Load some libraries. Mostly the same as in original script except for:
library("ggplot2")
library("kableExtra") # This is just for rendering this html.
library("tinytex")
library("dada2")
library("ShortRead")
library("Biostrings")
library("data.table")
library("biomformat")
library("DECIPHER")
library("phyloseq")
library("msa")
library("kmer")
library("vegan")
library("compositions")
library("htmltools")
library("foreach")
library("doParallel")
First I generate two objects that will harbour all intermediate steps and auxiliary variables for each dataset. I also generate a list for all custom functions.
setwd("/home/rik/fukushima/metabarcoding.git")
XITS <- list()
X16S <- list()
FUNS <- list()
source("functions.R")
Read in the data paths. Obviously adapted to my computer.
XITS$path <-"~/fukushima/metabarcoding.git/HN00198716/ITS"
X16S$path <-"~/fukushima/metabarcoding.git/HN00198716/16S"
XITS$fnFs <- sort(list.files(XITS$path, pattern = "_1.fq.gz", full.names = TRUE))
length(XITS$fnFs)
XITS$fnRs <- sort(list.files(XITS$path, pattern = "_2.fq.gz", full.names = TRUE))
length(XITS$fnRs)
X16S$fnFs <- sort(list.files(X16S$path, pattern = "_1.fq.gz", full.names = TRUE))
length(X16S$fnFs)
X16S$fnRs <- sort(list.files(X16S$path, pattern = "_2.fq.gz", full.names = TRUE))
length(X16S$fnRs)
# Extract all sample names:
XITS$sample.names <- unname(sapply(XITS$fnFs, FUNS$get.sample.name))
XITS$sample.names[XITS$sample.names == "mock"] <- c("mockITS","mockITS1")
X16S$sample.names <- unname(sapply(X16S$fnFs, FUNS$get.sample.name))
X16S$sample.names[X16S$sample.names == "mock"] <- c("mock16S","mock16S1")
sample.names <- unique(c(XITS$sample.names,X16S$sample.names))
META <- read.table("~/fukushima/metabarcoding.git/metadata.csv",sep=",",dec=".",header=T)[,1:7]
rownames(META) <- META$library
# Here I define 5 colors that we will use throughout for site, used for the levels
# "Arakawa", "Futaba", "Okuma", "Tsushima" and "University"
# I will relevel META to put them in a more logical order as well.
META$site <- factor(META$site,levels=c("University", "Arakawa", "Tsushima", "Okuma", "Futaba"))
sitecolors <- c("#ADD8E6","#2E8B57","#FF7F50","#FF91A4","#800020")
XITS$META <- META[match(XITS$sample.names, META$library), ]
X16S$META <- META[match(X16S$sample.names, META$library), ]
Here I advise to plot quality profiles of the same run, FW and RV next to each other. I also analyzed all files with fastQC, and compiled them in a summary report using multiQC. All these summary reports show more or less the same statistics, so it’s up to personal preference which one you prefer to use. But you should actually go through a round of thorough checking for all libraries.
# This is the loop that you can use to generate all 46 quality profiles:
#for (i in 1:46){plotQualityProfile(c(fnFs[i],fnRs[i]))}
# Example:
plotQualityProfile(c(XITS$fnFs[16],XITS$fnRs[16]))
plotQualityProfile(c(X16S$fnFs[16],X16S$fnRs[16]))
In all cases, >99% of sequences are retained. Question: when were the PhiX reads filtered out?
Primers are present and already confirmed by fastQC. At first sight, there seems to be no need to turn them in all directions, but I agree it’s always better safe than sorry. So we do check for primers in all orientations.
XITS$FWD <- "GTGARTCATCGARTCTTTGAA"
XITS$REV <- "TCCTCCGCTTATTGATATGC"
X16S$FWD <- "CCTACGGGNGGCWGCAG"
X16S$REV <- "GACTACHVGGGTATCTAATCC"
XITS$FWD.orients <- FUNS$allOrients(XITS$FWD)
XITS$REV.orients <- FUNS$allOrients(XITS$REV)
X16S$FWD.orients <- FUNS$allOrients(X16S$FWD)
X16S$REV.orients <- FUNS$allOrients(X16S$REV)
XITS$fnFs.filtN <- file.path(XITS$path, "filtN", basename(XITS$fnFs))
XITS$fnRs.filtN <- file.path(XITS$path, "filtN", basename(XITS$fnRs))
X16S$fnFs.filtN <- file.path(X16S$path, "filtN", basename(X16S$fnFs))
X16S$fnRs.filtN <- file.path(X16S$path, "filtN", basename(X16S$fnRs))
# Our first filtering step using the filterAndTrim function is really just getting rid of sequences containing N nucleotides, and PhiX.
# About the multithread parameter: only use this if your computer allows parallel processing.
XITS$out <- filterAndTrim(XITS$fnFs, XITS$fnFs.filtN, XITS$fnRs, XITS$fnRs.filtN, maxN = 0, matchIDs=T, rm.phix=T, multithread = 12)
X16S$out <- filterAndTrim(X16S$fnFs, X16S$fnFs.filtN, X16S$fnRs, X16S$fnRs.filtN, maxN = 0, matchIDs=T, rm.phix=T, multithread = 12)
# For ITS:
XITS$primersummary <- FUNS$generate_primer_summary(XITS, "fnFs.filtN", "fnRs.filtN", multithread = 12)
# For 16S:
X16S$primersummary <- FUNS$generate_primer_summary(X16S, "fnFs.filtN", "fnRs.filtN", multithread = 12)
| FW.fw.Forward | FW.fw.Complement | FW.fw.Reverse | FW.fw.RevComp | FW.rv.Forward | FW.rv.Complement | FW.rv.Reverse | FW.rv.RevComp | RV.fw.Forward | RV.fw.Complement | RV.fw.Reverse | RV.fw.RevComp | RV.rv.Forward | RV.rv.Complement | RV.rv.Reverse | RV.rv.RevComp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A1T | 77813 | 0 | 0 | 0 | 0 | 0 | 0 | 13350 | 0 | 0 | 0 | 19358 | 75396 | 0 | 0 | 0 |
| A2T | 73765 | 0 | 0 | 0 | 0 | 0 | 0 | 15852 | 0 | 0 | 0 | 23964 | 71090 | 0 | 0 | 0 |
| A2T1 | 80687 | 0 | 0 | 0 | 0 | 0 | 0 | 8948 | 0 | 0 | 0 | 13140 | 77918 | 0 | 0 | 0 |
| A3T | 79093 | 0 | 0 | 0 | 0 | 0 | 0 | 7598 | 0 | 0 | 0 | 12367 | 76578 | 0 | 0 | 0 |
| A3T1 | 66472 | 0 | 0 | 0 | 0 | 0 | 0 | 9514 | 0 | 0 | 0 | 14092 | 64164 | 0 | 0 | 0 |
| A4T | 78599 | 0 | 0 | 0 | 0 | 0 | 0 | 5839 | 0 | 0 | 0 | 8362 | 75914 | 0 | 0 | 0 |
| A4T1 | 92029 | 0 | 0 | 0 | 0 | 0 | 0 | 4872 | 0 | 0 | 0 | 6970 | 89315 | 0 | 0 | 0 |
| A6T | 77947 | 0 | 0 | 0 | 0 | 0 | 0 | 3989 | 0 | 0 | 0 | 6064 | 75705 | 0 | 0 | 0 |
| A6T1 | 93955 | 0 | 0 | 0 | 0 | 0 | 0 | 3771 | 0 | 0 | 0 | 5585 | 90905 | 0 | 0 | 0 |
| A7T | 86119 | 0 | 0 | 0 | 0 | 0 | 0 | 9994 | 0 | 0 | 0 | 13895 | 83486 | 0 | 0 | 0 |
| A8T | 77877 | 0 | 0 | 0 | 0 | 0 | 0 | 6869 | 0 | 0 | 0 | 10560 | 75306 | 0 | 0 | 0 |
| A9T | 79231 | 0 | 0 | 0 | 1 | 0 | 0 | 8442 | 1 | 0 | 0 | 12425 | 76528 | 0 | 0 | 0 |
| C1T | 64535 | 0 | 0 | 0 | 0 | 0 | 0 | 1427 | 0 | 0 | 0 | 2159 | 62013 | 0 | 0 | 0 |
| C2T | 55851 | 0 | 0 | 0 | 0 | 0 | 0 | 1210 | 0 | 0 | 0 | 2572 | 53699 | 0 | 0 | 0 |
| C3T | 82673 | 0 | 0 | 0 | 0 | 0 | 0 | 1393 | 0 | 0 | 0 | 3698 | 79537 | 0 | 0 | 0 |
| C5T | 78502 | 0 | 0 | 0 | 0 | 0 | 0 | 391 | 0 | 0 | 0 | 669 | 75487 | 0 | 0 | 0 |
| C6T | 67432 | 0 | 0 | 0 | 0 | 0 | 0 | 467 | 0 | 0 | 0 | 943 | 65158 | 0 | 0 | 0 |
| F1T | 74167 | 0 | 0 | 0 | 0 | 0 | 0 | 2794 | 0 | 0 | 0 | 5249 | 71791 | 0 | 0 | 0 |
| F3T | 73513 | 0 | 0 | 0 | 0 | 0 | 0 | 4541 | 0 | 0 | 0 | 8099 | 71079 | 0 | 0 | 0 |
| FU1T | 84775 | 0 | 0 | 0 | 0 | 0 | 0 | 2505 | 0 | 0 | 0 | 4442 | 82113 | 0 | 0 | 0 |
| FU2T | 67011 | 0 | 0 | 0 | 0 | 0 | 0 | 2047 | 0 | 0 | 0 | 4614 | 64837 | 0 | 0 | 0 |
| FU3T | 78759 | 0 | 0 | 0 | 0 | 0 | 0 | 6105 | 0 | 0 | 0 | 9842 | 76029 | 0 | 0 | 0 |
| FU3T1 | 76464 | 0 | 0 | 0 | 0 | 0 | 0 | 4650 | 0 | 0 | 0 | 8789 | 74295 | 0 | 0 | 0 |
| HT | 71981 | 0 | 0 | 0 | 0 | 0 | 0 | 2012 | 0 | 0 | 0 | 3122 | 69441 | 0 | 0 | 0 |
| mockITS | 82685 | 0 | 0 | 1 | 1 | 0 | 0 | 17954 | 1 | 0 | 0 | 33689 | 79930 | 0 | 0 | 0 |
| mockITS1 | 78942 | 0 | 0 | 1 | 1 | 0 | 0 | 16111 | 0 | 0 | 0 | 31473 | 76322 | 0 | 0 | 0 |
| O10T | 90383 | 0 | 0 | 0 | 0 | 0 | 0 | 5890 | 0 | 0 | 0 | 11378 | 87120 | 0 | 0 | 0 |
| O11T | 49393 | 0 | 0 | 0 | 0 | 0 | 0 | 1248 | 0 | 0 | 0 | 2028 | 47526 | 0 | 0 | 0 |
| O12T | 82557 | 0 | 0 | 0 | 0 | 0 | 0 | 3095 | 0 | 0 | 0 | 5599 | 79568 | 0 | 0 | 0 |
| O2T | 86638 | 0 | 0 | 0 | 0 | 0 | 0 | 6116 | 0 | 0 | 0 | 12092 | 83795 | 0 | 0 | 0 |
| O3T | 89327 | 0 | 0 | 0 | 0 | 0 | 0 | 10358 | 0 | 0 | 0 | 18191 | 85950 | 0 | 0 | 0 |
| O4T | 78314 | 0 | 0 | 0 | 0 | 0 | 0 | 3269 | 0 | 0 | 0 | 5680 | 75476 | 0 | 0 | 0 |
| O5T | 82614 | 0 | 0 | 0 | 0 | 0 | 0 | 7039 | 0 | 0 | 0 | 11972 | 79701 | 0 | 0 | 0 |
| O6T | 77044 | 0 | 0 | 0 | 0 | 0 | 0 | 3715 | 0 | 0 | 0 | 6935 | 74449 | 0 | 0 | 0 |
| O8T | 80788 | 0 | 0 | 0 | 0 | 0 | 0 | 5730 | 0 | 0 | 0 | 11518 | 79342 | 0 | 0 | 0 |
| O9T | 70605 | 0 | 0 | 0 | 0 | 0 | 0 | 4127 | 0 | 0 | 0 | 6807 | 67977 | 0 | 0 | 0 |
| O9T1 | 86912 | 0 | 0 | 1 | 1 | 0 | 0 | 5731 | 1 | 0 | 0 | 10308 | 83948 | 0 | 0 | 0 |
| T10T | 51784 | 0 | 0 | 0 | 0 | 0 | 0 | 8305 | 0 | 0 | 0 | 14446 | 49815 | 0 | 0 | 0 |
| T10T1 | 92874 | 0 | 0 | 0 | 1 | 0 | 0 | 14787 | 1 | 0 | 0 | 25950 | 89680 | 0 | 0 | 0 |
| T11T | 74391 | 0 | 0 | 0 | 0 | 0 | 0 | 1949 | 0 | 0 | 0 | 4664 | 71800 | 0 | 0 | 0 |
| T12T | 86740 | 0 | 0 | 0 | 0 | 0 | 0 | 22301 | 0 | 0 | 0 | 32543 | 83731 | 0 | 0 | 0 |
| T12T1 | 92359 | 0 | 0 | 0 | 0 | 0 | 0 | 13065 | 0 | 0 | 0 | 21035 | 89499 | 0 | 0 | 0 |
| T5T | 80626 | 0 | 0 | 0 | 0 | 0 | 0 | 6055 | 0 | 0 | 0 | 9938 | 77903 | 0 | 0 | 0 |
| T9T | 84065 | 0 | 0 | 0 | 0 | 0 | 0 | 2688 | 0 | 0 | 0 | 3891 | 81492 | 0 | 0 | 0 |
| T9T1 | 91939 | 0 | 0 | 0 | 0 | 0 | 0 | 5586 | 0 | 0 | 0 | 9799 | 88958 | 0 | 0 | 0 |
| U3T | 64310 | 0 | 0 | 0 | 1 | 0 | 0 | 1325 | 1 | 0 | 0 | 2716 | 62055 | 0 | 0 | 0 |
| FW.fw.Forward | FW.fw.Complement | FW.fw.Reverse | FW.fw.RevComp | FW.rv.Forward | FW.rv.Complement | FW.rv.Reverse | FW.rv.RevComp | RV.fw.Forward | RV.fw.Complement | RV.fw.Reverse | RV.fw.RevComp | RV.rv.Forward | RV.rv.Complement | RV.rv.Reverse | RV.rv.RevComp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A1T | 63653 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 16 | 62935 | 0 | 0 | 0 |
| A2T | 64125 | 0 | 0 | 0 | 1 | 0 | 0 | 10 | 1 | 0 | 0 | 13 | 63342 | 0 | 0 | 0 |
| A2T1 | 83939 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 82776 | 0 | 0 | 0 |
| A3T | 64353 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 6 | 63473 | 0 | 0 | 0 |
| A3T1 | 64896 | 0 | 0 | 0 | 3 | 0 | 0 | 2 | 3 | 0 | 0 | 7 | 64144 | 0 | 0 | 0 |
| A4T | 47381 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 3 | 46953 | 0 | 0 | 0 |
| A4T1 | 60805 | 0 | 0 | 0 | 1 | 0 | 0 | 5 | 1 | 0 | 0 | 19 | 61391 | 0 | 0 | 0 |
| A5T | 78884 | 0 | 0 | 0 | 2 | 0 | 0 | 203 | 2 | 0 | 0 | 1569 | 81228 | 0 | 0 | 0 |
| A6T | 96701 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 6 | 96089 | 0 | 0 | 0 |
| A6T1 | 46396 | 0 | 0 | 0 | 0 | 0 | 0 | 353 | 0 | 0 | 0 | 1378 | 46903 | 0 | 0 | 0 |
| A7T | 61965 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 17 | 61111 | 0 | 0 | 0 |
| A8T | 65411 | 0 | 0 | 0 | 3 | 0 | 0 | 12 | 1 | 0 | 0 | 83 | 64690 | 0 | 0 | 0 |
| A9T | 67579 | 0 | 0 | 0 | 3 | 0 | 0 | 3 | 4 | 0 | 0 | 17 | 66986 | 0 | 0 | 0 |
| C1T | 64170 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 0 | 6 | 63546 | 0 | 0 | 0 |
| C2T | 52220 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 3 | 51441 | 0 | 0 | 0 |
| C3T | 60544 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 7 | 59731 | 0 | 0 | 0 |
| C4T | 64441 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 62853 | 0 | 0 | 0 |
| C5T | 56084 | 0 | 0 | 0 | 2 | 0 | 0 | 3 | 2 | 0 | 0 | 6 | 55094 | 0 | 0 | 0 |
| C6T | 60380 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 4 | 59438 | 0 | 0 | 0 |
| F1T | 86897 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 10 | 85657 | 0 | 0 | 0 |
| F3T | 62047 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 3 | 61334 | 0 | 0 | 0 |
| FU1T | 58131 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 1 | 0 | 0 | 9 | 57621 | 0 | 0 | 0 |
| FU2T | 55921 | 0 | 0 | 0 | 2 | 0 | 0 | 3 | 2 | 0 | 0 | 8 | 55135 | 0 | 0 | 0 |
| FU3T | 56426 | 0 | 0 | 0 | 2 | 0 | 0 | 2 | 2 | 0 | 0 | 3 | 55624 | 0 | 0 | 0 |
| FU3T1 | 63504 | 0 | 0 | 0 | 3 | 0 | 0 | 2 | 2 | 0 | 0 | 7 | 62943 | 0 | 0 | 0 |
| HT | 53937 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 53360 | 0 | 0 | 0 |
| mock16S | 52943 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 4 | 0 | 0 | 3 | 51888 | 0 | 0 | 0 |
| mock16S1 | 44966 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 3 | 0 | 0 | 5 | 44047 | 0 | 0 | 0 |
| O10T | 62655 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 61474 | 0 | 0 | 0 |
| O11T | 61037 | 0 | 0 | 0 | 2 | 0 | 0 | 4 | 2 | 0 | 0 | 4 | 59977 | 0 | 0 | 0 |
| O12T | 56698 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 6 | 55962 | 0 | 0 | 0 |
| O1T | 57881 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 5 | 56938 | 0 | 0 | 0 |
| O2T | 46587 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 3 | 45578 | 0 | 0 | 0 |
| O3T | 63209 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 10 | 62312 | 0 | 0 | 0 |
| O4T | 66018 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 2 | 0 | 0 | 1 | 65185 | 0 | 0 | 0 |
| O5T | 66160 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 65345 | 0 | 0 | 0 |
| O6T | 52017 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 2 | 51475 | 0 | 0 | 0 |
| O7T | 81704 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 17 | 81564 | 0 | 0 | 0 |
| O8T | 59378 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 5 | 58426 | 0 | 0 | 0 |
| O9T | 55057 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 4 | 54122 | 0 | 0 | 0 |
| O9T1 | 55066 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 54575 | 0 | 0 | 0 |
| T10T | 62680 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 62262 | 0 | 0 | 0 |
| T10T1 | 67617 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 10 | 67569 | 0 | 0 | 0 |
| T11T | 50304 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 6 | 49669 | 0 | 0 | 0 |
| T11T1 | 58175 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 56927 | 0 | 0 | 0 |
| T12T | 55569 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 54907 | 0 | 0 | 0 |
| T12T1 | 85881 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 5 | 85123 | 0 | 0 | 0 |
| T1T | 78272 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 77752 | 0 | 0 | 0 |
| T3T | 90936 | 0 | 0 | 0 | 0 | 0 | 0 | 424 | 0 | 0 | 0 | 1336 | 92033 | 0 | 0 | 0 |
| T5T | 64045 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 4 | 63362 | 0 | 0 | 0 |
| T9T | 60004 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 35 | 60612 | 0 | 0 | 0 |
| T9T1 | 57229 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 56654 | 0 | 0 | 0 |
| U3T | 55430 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 54317 | 0 | 0 | 0 |
In this table, you can sporadically spot the odd badly oriented primer, but it’s such a small fraction of the dataset (e.g. in A9T) , that I would not bother trying to filter those out. Readthrough however, seems to be pretty common. Here I calculate the percentage readthrough for all libraries:
# For ITS
XITS$proportion_revcomp_fw <- XITS$primersummary[,8]/XITS$primersummary[,1]
XITS$proportion_revcomp_rv <- XITS$primersummary[,12]/XITS$primersummary[,13]
# For 16S
X16S$proportion_revcomp_fw <- X16S$primersummary[,8]/X16S$primersummary[,1]
X16S$proportion_revcomp_rv <- X16S$primersummary[,12]/X16S$primersummary[,13]
plot(XITS$proportion_revcomp_fw,XITS$proportion_revcomp_rv,log="xy",col="white",cex=0.75, main = "ITS: proportion of reads with readthrough of the entire primer")
text(XITS$proportion_revcomp_fw,XITS$proportion_revcomp_rv,labels=sample.names,cex=0.75)
plot(X16S$proportion_revcomp_fw + 1e-06 ,X16S$proportion_revcomp_rv + 1e-06 , log="xy",col="white",cex=0.75, main = "16S: proportion of reads with readthrough of the entire primer")
text(X16S$proportion_revcomp_fw + 1e-06 ,X16S$proportion_revcomp_rv + 1e-06 , labels=sample.names,cex=0.75)
There is virtually no readthrough for the 16S samples, and when there is some, it’s correlated between forward and reverse. I had to add a small quantity to the 16S counts in order to correct for zeros.
For ITS, in the forward libraries, the proportion ranges from 0.5% to 25%. In the reverse libraries, it’s between 0.9% and 42%. The plot shows that forward and reverse are correlated, as expected.
Samples C5T and C6T are clearly outliers in this respect, with very few detected readthroughs.
What we can tentatively conclude for ITS at this point:
Next step is cutadapt. I installed this on my local machine, and ran it via a system call:
cutadapt <- "/usr/bin/cutadapt"
system2(cutadapt, args = "--version")
# Generate a subdirectory that will contain the cutadapt trimmed files
XITS$path.cut <- file.path(XITS$path, "cutadapt"); if(!dir.exists(XITS$path.cut)) dir.create(XITS$path.cut)
X16S$path.cut <- file.path(X16S$path, "cutadapt"); if(!dir.exists(X16S$path.cut)) dir.create(X16S$path.cut)
XITS$fnFs.cut <- file.path(XITS$path.cut, basename(XITS$fnFs))
XITS$fnRs.cut <- file.path(XITS$path.cut, basename(XITS$fnRs))
X16S$fnFs.cut <- file.path(X16S$path.cut, basename(X16S$fnFs))
X16S$fnRs.cut <- file.path(X16S$path.cut, basename(X16S$fnRs))
XITS$FWD.RC <- dada2:::rc(XITS$FWD)
XITS$REV.RC <- dada2:::rc(XITS$REV)
X16S$FWD.RC <- dada2:::rc(X16S$FWD)
X16S$REV.RC <- dada2:::rc(X16S$REV)
# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
XITS$R1.flags <- paste("-g", XITS$FWD, "-a", XITS$REV.RC)
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
XITS$R2.flags <- paste("-G", XITS$REV, "-A", XITS$FWD.RC)
# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
X16S$R1.flags <- paste("-g", X16S$FWD, "-a", X16S$REV.RC)
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
X16S$R2.flags <- paste("-G", X16S$REV, "-A", X16S$FWD.RC)
# Run Cutadapt for ITS
for(i in seq_along(XITS$fnFs)) {
system2("cutadapt", args = c(XITS$R1.flags, XITS$R2.flags, "-n", 2, # -n 2 required to remove FWD and REV from reads
"-o", XITS$fnFs.cut[i], "-p", XITS$fnRs.cut[i], # output files
XITS$fnFs.filtN[i], XITS$fnRs.filtN[i])) # input files
}
# Run Cutadapt for 16S
for(i in seq_along(X16S$fnFs)) {
system2("cutadapt", args = c(X16S$R1.flags, X16S$R2.flags, "-n", 2, # -n 2 required to remove FWD and REV from reads
"-o", X16S$fnFs.cut[i], "-p", X16S$fnRs.cut[i], # output files
X16S$fnFs.filtN[i], X16S$fnRs.filtN[i])) # input files
}
# For ITS:
XITS$primersummary.2 <- FUNS$generate_primer_summary(XITS, "fnFs.cut", "fnRs.cut")
# For 16S:
X16S$primersummary.2 <- FUNS$generate_primer_summary(X16S, "fnFs.cut", "fnRs.cut")
| FW.fw.Forward | FW.fw.Complement | FW.fw.Reverse | FW.fw.RevComp | FW.rv.Forward | FW.rv.Complement | FW.rv.Reverse | FW.rv.RevComp | RV.fw.Forward | RV.fw.Complement | RV.fw.Reverse | RV.fw.RevComp | RV.rv.Forward | RV.rv.Complement | RV.rv.Reverse | RV.rv.RevComp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A1T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A2T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A2T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A3T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A3T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A4T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A4T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A6T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A6T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A7T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A8T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A9T | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| C1T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| C2T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| C3T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| C5T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| C6T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| F1T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| F3T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| FU1T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| FU2T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| FU3T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| FU3T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| HT | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| mockITS | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| mockITS1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O10T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O11T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O12T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O2T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O3T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O4T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O5T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O6T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O8T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O9T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O9T1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T10T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T10T1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T11T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T12T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T12T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T5T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T9T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T9T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| U3T | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| FW.fw.Forward | FW.fw.Complement | FW.fw.Reverse | FW.fw.RevComp | FW.rv.Forward | FW.rv.Complement | FW.rv.Reverse | FW.rv.RevComp | RV.fw.Forward | RV.fw.Complement | RV.fw.Reverse | RV.fw.RevComp | RV.rv.Forward | RV.rv.Complement | RV.rv.Reverse | RV.rv.RevComp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A1T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A2T | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A2T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A3T | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A3T1 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A4T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A4T1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A5T | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A6T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A6T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A7T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A8T | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A9T | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| C1T | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| C2T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| C3T | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| C4T | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| C5T | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| C6T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| F1T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| F3T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| FU1T | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| FU2T | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| FU3T | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| FU3T1 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| HT | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| mock16S | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| mock16S1 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O10T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O11T | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O12T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O1T | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O2T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O3T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O4T | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O5T | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O6T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O7T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O8T | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O9T | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O9T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T10T | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T10T1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T11T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T11T1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T12T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T12T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T1T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T3T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T5T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T9T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T9T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| U3T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Now you can see the primers have gone, except for the sporadic badly oriented primer.
Trimmomatic has multiple important parameters:
This part has taken me quite some optimization. When we go too stringent, the reads end up too small. At first sight this is not a problem, but then afterwards when pairs have to get merged, they don’t have sufficient overlap anymore. So the parameters I end up using here, are stringent enough to generally exclude shitty base calls, but leave reads of sufficient length for merging. If afterwards it would turn out we get rid of samples C5T and U3T, we can crank up the stringency.
# Forward and reverse fastq filenames have the format:
XITS$cutFs <- sort(list.files(XITS$path.cut, pattern = "_1.fq.gz", full.names = TRUE))
XITS$cutRs <- sort(list.files(XITS$path.cut, pattern = "_2.fq.gz", full.names = TRUE))
dir.create(file.path(XITS$path.cut, "trim_paired"))
dir.create(file.path(XITS$path.cut, "trim_unpaired"))
XITS$trimFs_paired <- file.path(XITS$path.cut, "trim_paired", basename(XITS$cutFs))
XITS$trimFs_unpaired <- file.path(XITS$path.cut, "trim_unpaired", basename(XITS$cutFs))
XITS$trimRs_paired <- file.path(XITS$path.cut, "trim_paired", basename(XITS$cutRs))
XITS$trimRs_unpaired <- file.path(XITS$path.cut, "trim_unpaired", basename(XITS$cutRs))
for(i in 1:length(sample.names))
{
command <- paste(c("java -jar ~/Trimmomatic-0.39/trimmomatic-0.39.jar PE",
XITS$cutFs[i],
XITS$cutRs[i],
XITS$trimFs_paired[i],
XITS$trimFs_unpaired[i],
XITS$trimRs_paired[i],
XITS$trimRs_unpaired[i],
"LEADING:0 TRAILING:25 SLIDINGWINDOW:4:25 MINLEN:100"),collapse=" ")
system(command) # This is actually the call to execute the Trimmomatic call outlined above
}
# Forward and reverse fastq filenames have the format:
X16S$cutFs <- sort(list.files(X16S$path.cut, pattern = "_1.fq.gz", full.names = TRUE))
X16S$cutRs <- sort(list.files(X16S$path.cut, pattern = "_2.fq.gz", full.names = TRUE))
dir.create(file.path(X16S$path.cut, "trim_paired"))
dir.create(file.path(X16S$path.cut, "trim_unpaired"))
X16S$trimFs_paired <- file.path(X16S$path.cut, "trim_paired", basename(X16S$cutFs))
X16S$trimFs_unpaired <- file.path(X16S$path.cut, "trim_unpaired", basename(X16S$cutFs))
X16S$trimRs_paired <- file.path(X16S$path.cut, "trim_paired", basename(X16S$cutRs))
X16S$trimRs_unpaired <- file.path(X16S$path.cut, "trim_unpaired", basename(X16S$cutRs))
for(i in 1:length(sample.names))
{
command <- paste(c("java -jar ~/Trimmomatic-0.39/trimmomatic-0.39.jar PE",
X16S$cutFs[i],
X16S$cutRs[i],
X16S$trimFs_paired[i],
X16S$trimFs_unpaired[i],
X16S$trimRs_paired[i],
X16S$trimRs_unpaired[i],
"LEADING:0 TRAILING:25 SLIDINGWINDOW:4:25 MINLEN:100"),collapse=" ")
system(command) # This is actually the call to execute the Trimmomatic call outlined above
}
Also here, you should go through all of them.
plotQualityProfile(c(XITS$trimFs_paired[5],XITS$trimRs_paired[5]))
plotQualityProfile(c(X16S$trimFs_paired[5],X16S$trimRs_paired[5]))
This is a last filtering step, not very necessary anymore after Trimmomatic, but we go for it anwyay. The most important additional trimming happens on the total expected errors per read. With a maximum of 0.5 expected errors per read, we should be on the safe side. However, in my experience you can go down much lower (like < 0.05 easily) if your trimmomatic is more stringent. However, you will not be able to merge a lot of reads in C5T and U3T for the ITS.
Notice we are only working with the reads that came out of Trimmomatic as intact pairs. The reads that got orphaned, are not taken along the rest of the analysis.
XITS$filtFs <- file.path(XITS$path.cut, "filtered", basename(XITS$cutFs))
XITS$filtRs <- file.path(XITS$path.cut, "filtered", basename(XITS$cutRs))
XITS$out <- filterAndTrim(XITS$trimFs_paired, XITS$filtFs, XITS$trimRs_paired, XITS$filtRs,
maxN = 0,
maxEE = c(0.5,0.5),
truncQ = 2,
minLen = 100,
rm.phix = TRUE,
matchIDs=T,
compress = TRUE,
multithread = 12)
# Same for 16S
X16S$filtFs <- file.path(X16S$path.cut, "filtered", basename(X16S$cutFs))
X16S$filtRs <- file.path(X16S$path.cut, "filtered", basename(X16S$cutRs))
X16S$out <- filterAndTrim(X16S$trimFs_paired, X16S$filtFs, X16S$trimRs_paired, X16S$filtRs,
maxN = 0,
maxEE = c(0.5,0.5),
truncQ = 2,
minLen = 100,
rm.phix = TRUE,
matchIDs=T,
compress = TRUE,
multithread = 12)
Detailed info about these functions can be found in the dada2 documentation. We start with learning the error profiles. nbases is set high enough so the entire dataset should be used for training.
XITS$errF <- learnErrors(XITS$filtFs, multithread = 12, nbases= 1e+10)
XITS$errR <- learnErrors(XITS$filtRs, multithread = 12, nbases= 1e+10)
X16S$errR <- learnErrors(X16S$filtFs, multithread = 12, nbases= 1e+10)
X16S$errR <- learnErrors(X16S$filtRs, multithread = 12, nbases= 1e+10)
We can plot the observed error profile for each transition/transversion and compare to theoretical values. I used these plots a lot for the optimization of the filtering parameters.
plotErrors(XITS$errF, nominalQ=TRUE)
plotErrors(XITS$errR, nominalQ=TRUE)
plotErrors(X16S$errR, nominalQ=TRUE)
plotErrors(X16S$errR, nominalQ=TRUE)
For 16S they don’t look super good. It does not seem to be a major concern (see here). We may want to look if additional trimming or filtering improves the error profiles, but I currently don’t consider it a priority.
Next we move on with dereplication, and the real dada function:
XITS$derepFs <- derepFastq(XITS$filtFs, verbose=TRUE)
XITS$derepRs <- derepFastq(XITS$filtRs, verbose=TRUE)
# Name the derep-class objects by the sample names
names(XITS$derepFs) <- XITS$sample.names
names(XITS$derepRs) <- XITS$sample.names
X16S$derepFs <- derepFastq(X16S$filtFs, verbose=TRUE)
X16S$derepRs <- derepFastq(X16S$filtRs, verbose=TRUE)
names(X16S$derepFs) <- X16S$sample.names
names(X16S$derepRs) <- X16S$sample.names
XITS$dadaFs <- dada(XITS$derepFs, err=XITS$errF, multithread=12)
XITS$dadaRs <- dada(XITS$derepRs, err=XITS$errR, multithread=12)
X16S$dadaFs <- dada(X16S$derepFs, err=X16S$errR, multithread=12)
X16S$dadaRs <- dada(X16S$derepRs, err=X16S$errR, multithread=12)
XITS$mergers <- mergePairs(XITS$dadaFs, XITS$derepFs, XITS$dadaRs, XITS$derepRs, verbose=TRUE, maxMismatch=1)
XITS$seqtab <- makeSequenceTable(XITS$mergers)
table(nchar(getSequences(XITS$seqtab)))
plot(table(nchar(getSequences(XITS$seqtab))))
XITS$seqtab.nochim <- removeBimeraDenovo(XITS$seqtab, method="consensus", multithread=16, verbose=TRUE)
X16S$mergers <- mergePairs(X16S$dadaFs, X16S$derepFs, X16S$dadaRs, X16S$derepRs, verbose=TRUE, maxMismatch=1)
X16S$seqtab <- makeSequenceTable(X16S$mergers)
table(nchar(getSequences(X16S$seqtab)))
plot(table(nchar(getSequences(X16S$seqtab))))
X16S$seqtab.nochim <- removeBimeraDenovo(X16S$seqtab, method="consensus", multithread=16, verbose=TRUE)
A bimera is a two-parent chimera, in which the left side is made up of one parent sequence, and the right-side made up of a second parent sequence.
For ITS, we identified 142 bimeras out of 4954 input sequences. Nothing to worry about. For 16S, we identified 1987 bimeras out of 11820 input sequences, for 16S. That’s almost 17%, which is a lot. Not sure if we can do a lot about it. It’s something that typically happens during the PCR phase. What would worry me more is that they are perhaps false positive bimera’s. Anyhow, for now they are just removed for the remainder of the analysis, which is certainly the safest option.
Now reads have been filtered, errors have been learned, doubles have been dereplicated, dada has been executed, reads have been merged and chimeras have been removed. Let’s summarise this in a table.
XITS$track <- cbind(XITS$out, sapply(XITS$dadaFs, FUNS$getN), sapply(XITS$dadaRs, FUNS$getN), sapply(XITS$mergers, FUNS$getN), rowSums(XITS$seqtab.nochim))
X16S$track <- cbind(X16S$out, sapply(X16S$dadaFs, FUNS$getN), sapply(X16S$dadaRs, FUNS$getN), sapply(X16S$mergers, FUNS$getN), rowSums(X16S$seqtab.nochim))
# If processing a single sample, remove the sapply calls: e.g. replace sapply(dadaFs, FUNS$getN) with FUNS$getN(dadaFs)
colnames(XITS$track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
colnames(X16S$track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(XITS$track) <- XITS$sample.names
rownames(X16S$track) <- X16S$sample.names
| input | filtered | denoisedF | denoisedR | merged | nonchim | |
|---|---|---|---|---|---|---|
| A1T | 70386 | 70266 | 69452 | 68912 | 67097 | 63584 |
| A2T | 68132 | 68007 | 67496 | 67035 | 63102 | 62928 |
| A2T1 | 73934 | 73791 | 73668 | 73544 | 71463 | 71122 |
| A3T | 72770 | 72630 | 71946 | 71185 | 68587 | 68362 |
| A3T1 | 61226 | 61112 | 60182 | 59156 | 55828 | 55528 |
| A4T | 72194 | 72074 | 71949 | 71832 | 71018 | 71018 |
| A4T1 | 85347 | 85192 | 85052 | 85121 | 84877 | 84799 |
| A6T | 71979 | 71855 | 71767 | 71714 | 71222 | 71222 |
| A6T1 | 86518 | 86349 | 86276 | 86079 | 82296 | 82272 |
| A7T | 78905 | 78756 | 77638 | 76699 | 73327 | 72872 |
| A8T | 71930 | 71730 | 71113 | 70852 | 67771 | 67438 |
| A9T | 72184 | 72073 | 71421 | 71182 | 69944 | 69824 |
| C1T | 59307 | 59174 | 58745 | 58253 | 57473 | 57273 |
| C2T | 49926 | 49832 | 49326 | 48675 | 47362 | 47362 |
| C3T | 72283 | 72125 | 71621 | 70740 | 69682 | 69664 |
| C5T | 58387 | 58066 | 57728 | 57554 | 57086 | 57077 |
| C6T | 59801 | 59697 | 59235 | 58777 | 57292 | 57249 |
| F1T | 68057 | 67931 | 67874 | 67752 | 65447 | 65220 |
| F3T | 66860 | 66721 | 66469 | 66008 | 65045 | 65022 |
| FU1T | 78449 | 78331 | 77705 | 77199 | 74796 | 74264 |
| FU2T | 61114 | 60997 | 60372 | 59555 | 57030 | 57023 |
| FU3T | 71972 | 71848 | 71141 | 70313 | 63472 | 63472 |
| FU3T1 | 69865 | 69712 | 68952 | 68013 | 58476 | 58281 |
| HT | 64761 | 64632 | 64462 | 64355 | 63734 | 63725 |
| mockITS | 75521 | 75385 | 74118 | 75289 | 72236 | 67997 |
| mockITS1 | 72539 | 72415 | 72249 | 72365 | 70824 | 66778 |
| O10T | 81955 | 81790 | 81205 | 80688 | 78492 | 78211 |
| O11T | 45587 | 45487 | 45045 | 44566 | 43558 | 43533 |
| O12T | 76520 | 76304 | 75707 | 75026 | 73465 | 73401 |
| O2T | 80001 | 79800 | 79079 | 78480 | 76224 | 76224 |
| O3T | 81145 | 80955 | 79963 | 79149 | 75292 | 75108 |
| O4T | 70441 | 70258 | 69832 | 69376 | 68470 | 68455 |
| O5T | 75464 | 75286 | 74609 | 73869 | 70933 | 70737 |
| O6T | 71320 | 71174 | 70793 | 70299 | 68382 | 68317 |
| O8T | 73888 | 73683 | 72160 | 70893 | 66184 | 65720 |
| O9T | 64053 | 63924 | 63357 | 62422 | 60155 | 59923 |
| O9T1 | 79935 | 79775 | 79546 | 78990 | 76642 | 76642 |
| T10T | 47337 | 47217 | 46742 | 46071 | 44448 | 44448 |
| T10T1 | 84002 | 83822 | 83568 | 83065 | 80652 | 80583 |
| T11T | 67118 | 66983 | 66624 | 66030 | 63206 | 63206 |
| T12T | 78917 | 78778 | 78440 | 77814 | 74555 | 74555 |
| T12T1 | 84568 | 84376 | 84200 | 83901 | 80964 | 80964 |
| T5T | 73430 | 73269 | 73086 | 72866 | 71579 | 70170 |
| T9T | 76561 | 76388 | 76319 | 76258 | 75535 | 75531 |
| T9T1 | 84016 | 83742 | 83596 | 83384 | 78524 | 78406 |
| U3T | 50804 | 50605 | 50185 | 49728 | 48543 | 48543 |
| input | filtered | denoisedF | denoisedR | merged | nonchim | |
|---|---|---|---|---|---|---|
| A1T | 58298 | 58088 | 50174 | 54312 | 26564 | 25958 |
| A2T | 59045 | 58852 | 51300 | 56418 | 27990 | 27025 |
| A2T1 | 77702 | 77461 | 75414 | 75608 | 58887 | 57620 |
| A3T | 59004 | 58822 | 51207 | 55144 | 25989 | 25467 |
| A3T1 | 59571 | 59368 | 50065 | 55048 | 23452 | 22995 |
| A4T | 43614 | 43430 | 41868 | 41885 | 30600 | 29690 |
| A4T1 | 57779 | 57598 | 55834 | 55053 | 44921 | 44571 |
| A5T | 75612 | 75312 | 69389 | 68726 | 52896 | 42132 |
| A6T | 89876 | 89614 | 86524 | 86897 | 67940 | 67259 |
| A6T1 | 43847 | 43721 | 41482 | 41083 | 26800 | 26678 |
| A7T | 56757 | 56580 | 48268 | 52726 | 24378 | 23725 |
| A8T | 60761 | 60592 | 53770 | 57797 | 33529 | 32979 |
| A9T | 62742 | 62573 | 54932 | 59880 | 33967 | 32687 |
| C1T | 59037 | 58856 | 49470 | 54285 | 22477 | 21698 |
| C2T | 47876 | 47709 | 39898 | 43217 | 19114 | 17974 |
| C3T | 55986 | 55825 | 46618 | 51598 | 22893 | 22073 |
| C4T | 55637 | 55369 | 47507 | 52512 | 17220 | 17012 |
| C5T | 51019 | 50832 | 43880 | 46847 | 24104 | 23166 |
| C6T | 55322 | 55130 | 46604 | 51304 | 22733 | 21824 |
| F1T | 79476 | 79198 | 76276 | 77110 | 55924 | 54773 |
| F3T | 57117 | 56920 | 51240 | 52693 | 28579 | 28209 |
| FU1T | 53282 | 53093 | 45593 | 50117 | 24051 | 23543 |
| FU2T | 51260 | 51130 | 42048 | 46627 | 17654 | 17481 |
| FU3T | 52126 | 51945 | 44034 | 47499 | 21735 | 21389 |
| FU3T1 | 59302 | 59134 | 49675 | 54582 | 24739 | 24426 |
| HT | 49285 | 49127 | 42564 | 44980 | 24441 | 24161 |
| mock16S | 48115 | 47954 | 40647 | 42134 | 30603 | 19789 |
| mock16S1 | 41017 | 40899 | 34302 | 36737 | 25857 | 16894 |
| O10T | 57144 | 56959 | 49240 | 53676 | 28606 | 26030 |
| O11T | 56200 | 56040 | 49131 | 52007 | 26907 | 25675 |
| O12T | 52273 | 52100 | 43421 | 47156 | 20621 | 20139 |
| O1T | 53105 | 52915 | 42512 | 45399 | 17293 | 16958 |
| O2T | 40212 | 40013 | 31356 | 34343 | 10743 | 10684 |
| O3T | 57702 | 57484 | 47716 | 50322 | 18462 | 18249 |
| O4T | 60708 | 60516 | 54144 | 57072 | 30096 | 29179 |
| O5T | 61312 | 61116 | 52113 | 56020 | 27024 | 26500 |
| O6T | 47732 | 47578 | 40517 | 42785 | 20708 | 20440 |
| O7T | 76867 | 76642 | 74805 | 74893 | 60205 | 59498 |
| O8T | 54225 | 54066 | 43183 | 47091 | 17032 | 16773 |
| O9T | 50287 | 50101 | 43630 | 45059 | 20049 | 19831 |
| O9T1 | 51039 | 50853 | 47589 | 47010 | 24532 | 24275 |
| T10T | 57982 | 57827 | 50041 | 54640 | 26778 | 26176 |
| T10T1 | 63223 | 63048 | 58180 | 59669 | 35932 | 35390 |
| T11T | 46092 | 45960 | 39493 | 42511 | 15092 | 14541 |
| T11T1 | 52358 | 52170 | 42624 | 45825 | 17902 | 17707 |
| T12T | 51667 | 51514 | 46382 | 47994 | 25378 | 24604 |
| T12T1 | 79278 | 79028 | 76560 | 76755 | 47963 | 46678 |
| T1T | 72564 | 72347 | 69379 | 70421 | 56332 | 55152 |
| T3T | 86046 | 85830 | 64480 | 63764 | 50385 | 50312 |
| T5T | 59064 | 58880 | 50515 | 54745 | 26744 | 26028 |
| T9T | 54808 | 54582 | 52519 | 52930 | 37135 | 36792 |
| T9T1 | 53110 | 52936 | 50312 | 49845 | 32607 | 32058 |
| U3T | 49627 | 49455 | 41747 | 45740 | 18741 | 18603 |
Notice that this workflow has been optimized to retain as many read pairs as possible. This has been a deliberate choice in order to prevent biases. However, we may want to repeat some things with more stringent criteria. This depends on whether we decide to keep all samples on board.
We have observed that our mock communities dominate the first dimension of MDS (not shown), which is expected. First, we take a look at sequences that are unique to our mock communities:
# Subset seqtab.nochim to only contain the mocks
XITS$mocks <- XITS$seqtab.nochim[grepl("mock",rownames(XITS$seqtab.nochim)),]
X16S$mocks <- X16S$seqtab.nochim[grepl("mock",rownames(X16S$seqtab.nochim)),]
# Check which sequences get at least one count in either mock 1 or mock 2:
unname(which(XITS$mocks[1,] > 0 | XITS$mocks[2,] > 0))
unname(which(X16S$mocks[1,] > 0 | X16S$mocks[2,] > 0))
# Retrieve the sequences that correspond to these ASV's
XITS$mock_sequences <- colnames(XITS$mocks)[unname(which(XITS$mocks[1,] > 0 | XITS$mocks[2,] > 0))]
X16S$mock_sequences <- colnames(X16S$mocks)[unname(which(X16S$mocks[1,] > 0 | X16S$mocks[2,] > 0))]
# You can use these to check whether you find all taxons that you expect in your mocks, and whether you find something that you did not intend to include.
# Next, you can check how many times you find sequences from your mock in other samples:
XITS$mocks_table <- XITS$seqtab.nochim[,unname(which(XITS$mocks[1,] > 0 | XITS$mocks[2,] > 0))]
colnames(XITS$mocks_table) <- paste ("sequence",1:ncol(XITS$mocks_table))
X16S$mocks_table <- X16S$seqtab.nochim[,unname(which(X16S$mocks[1,] > 0 | X16S$mocks[2,] > 0))]
colnames(X16S$mocks_table) <- paste ("sequence",1:ncol(X16S$mocks_table))
Check the taxonomy of the mock samples. Notice that this is not the most efficient way of doing this, given that it takes a long time to load the references in RAM, and we will repeat this afterwards for the other samples.
X16S$mocks_taxa <- assignTaxonomy(X16S$mock_sequences,"/home/rik/fukushima/metabarcoding/silva_nr99_v138.1_wSpecies_train_set.fa.gz",verbose=T,multithread=10)
XITS$mocks_taxa <- assignTaxonomy(XITS$mock_sequences,"/home/rik/fukushima/metabarcoding/sh_general_release_dynamic_s_all_04.04.2024.fasta",verbose=T,multithread=10)
To be checked: for ITS, mock sequences 1 and sequence 6 occur in non-mock samples.
| sequence 1 | sequence 2 | sequence 3 | sequence 4 | sequence 5 | sequence 6 | sequence 7 | sequence 8 | sequence 9 | sequence 10 | sequence 11 | sequence 12 | sequence 13 | sequence 14 | sequence 15 | sequence 16 | sequence 17 | sequence 18 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A1T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A2T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A2T1 | 571 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A3T | 0 | 0 | 0 | 0 | 0 | 23 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A3T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A4T | 0 | 0 | 0 | 0 | 0 | 325 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A4T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A6T | 1790 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A6T1 | 1362 | 0 | 0 | 0 | 0 | 14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A7T | 0 | 0 | 0 | 0 | 0 | 210 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A8T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A9T | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| C1T | 0 | 0 | 0 | 0 | 0 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| C2T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| C3T | 0 | 0 | 0 | 0 | 0 | 25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| C5T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| C6T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| F1T | 7579 | 0 | 0 | 0 | 0 | 150 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| F3T | 15 | 0 | 0 | 0 | 0 | 21 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| FU1T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| FU2T | 0 | 0 | 0 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| FU3T | 0 | 0 | 0 | 0 | 0 | 12 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| FU3T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| HT | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| mockITS | 8684 | 12946 | 9141 | 6795 | 5630 | 5307 | 5402 | 5717 | 2545 | 2128 | 1603 | 1691 | 0 | 212 | 0 | 113 | 45 | 38 |
| mockITS1 | 8263 | 13451 | 8937 | 6899 | 5516 | 4527 | 5622 | 4607 | 2574 | 1996 | 1614 | 1506 | 858 | 216 | 125 | 0 | 33 | 34 |
| O10T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O11T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O12T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O2T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O3T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O4T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O5T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O6T | 0 | 0 | 0 | 0 | 0 | 30 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O8T | 0 | 0 | 0 | 0 | 0 | 69 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O9T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O9T1 | 0 | 0 | 0 | 0 | 0 | 126 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T10T | 2889 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T10T1 | 14482 | 0 | 0 | 0 | 0 | 149 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T11T | 1906 | 0 | 0 | 0 | 0 | 33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T12T | 8 | 0 | 0 | 0 | 0 | 23 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T12T1 | 0 | 0 | 0 | 0 | 0 | 44 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T5T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T9T | 704 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T9T1 | 3001 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| U3T | 0 | 0 | 0 | 0 | 0 | 14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Kingdom | Phylum | Class | Order | Family | Genus | Species | |
|---|---|---|---|---|---|---|---|
| CGCACCTTGCGCTCCTCGGTGTTCCGAGGAGCATGCCTGTTTGAGTGTCAGTAAATTCTCAACCCCTCTCGATTTGCTTCGAGCGGGTGCTTGGATGTTGGGGGCTGCCGGAGACACTGGATTCGTCCAGGACTCGGGCTCCTCTTAAATGAATCGGCTCGCGGTCGACTTTCGACTTTGCATGACAAGGCCTTTGGCGTGATAATGATCGCCGTTCGCCGAAGTGCAAGACAGAATGGTCCCGTGCCTCTAATGCGTCGACGCCTCTAGGGGCGTCTTCCTCATTGACGTTTGACCTCAAATCAGGTAGGACTACCCGCTGAACTTAA | k__Fungi | p__Basidiomycota | c__Agaricomycetes | o__Boletales | f__Suillaceae | g__Suillus | NA |
| CGCACATTGCGCCCCTTGGTATTCCAGGGGGCATGCCTGTTTGAGCGTCATTTCCTTCTCAAACATTCTGTTTGGTAGTGAGTGATACTCTTTGGAGTTAACTTGAAATTGCTGGCCTTTTCATTGGATGTTTTTTTTTTCCAAAGAGAGGTTTCTCTGCGTGCTTGAGGTATAATGCAAGTACGGTCGTTTTAGGTTTTACCAACTGCGGCTAATCTTTTTTATACTGAGCGTATTGGAACGTTATCGATAAGAAGAGAGCGTCTAGGCGAACAATGTTCTTAAAGTTTGACCTCAAATCAGGTAGGAGTACCCGCTGAACTTAA | k__Fungi | p__Ascomycota | c__Saccharomycetes | o__Saccharomycetales | f__Saccharomycetaceae | g__Saccharomyces | s__bayanus |
| CGCACATTGCGCCCCTTGGTATTCCATGGGGCATGCCTGTTCGAGCGTCATTTGTACCTTCAAGCTATGCTTGGTGTTGGGTGTTTGTCTCGCCTTTTGCGTGTAGACTCGCCTTAAAACAATTGGCAGCCGGCGTATTGATTTCGGAGCGCAGTACATCTCGCGCTTTGCACTCATAACGACGACATCCAAAAGTCATTTTTTACACTCTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAA | k__Fungi | p__Ascomycota | NA | NA | NA | NA | NA |
| CGCACATTGCGCCCCCTGGTATTCCGGGGGGCATGCCTGTCCGAGCGTCATTGCTGCCCTCAAGCACGGCTTGTGTGTTGGGCCTCCGTCCCCCACCCGGGGGACGGGCCCGAAAGGCAGCGGCGGCACCGTGTCCGGTCCTCGAGCGTATGGGGCTTTGTCACCCGCTCTGTAGGCCCGGCCGGCGCCTGCCGACCCTCATCATCCTTTTTTCAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAA | k__Fungi | p__Ascomycota | c__Eurotiomycetes | o__Eurotiales | f__Aspergillaceae | g__Penicillium | s__adametzii |
| CGCACCTTGCGCTCCTTGGTATTCCGAGGAGCATGCCTGTTTGAGTGTCATGAAATCTTCAACTTACAGACCTTTGCGGGTTTGTAGGCTTGGACTTGGAGGCTTGTCGGCCGTGTTTCGGCCGGCTCCTCTTAAATGCATTAGCTTGATTCCTTGCGGATCGGCTCTCGGTGTGATAATGTCTACGCCGCGACCGTGAAGCGTTTTGGCGAGCTTCTAACCGTCCTCGTTTGGACAGCTTTATGACCTCTGACCTCAAATCAGGTAGGACTACCCGCTGAACTTAA | k__Fungi | p__Basidiomycota | c__Agaricomycetes | o__Polyporales | f__Ganodermataceae | g__Ganoderma | NA |
| CGCACATTGCGCCCCCTGGTATTCCGGGGGGCATGCCTGTTCGAGCGTCATTTCACCACTCAAGCCTCGCTTGGTATTGGGCAACGCGGTCCGCCGCGTGCCTCAAATCGACCGGCTGGGTCTTCTGTCCCCTAAGCGTTGTGGAAACTATTCGCTAAAGGGTGTTCGGGAGGCTACGCCGTAAAACAACCCCATTTCTAAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAA | k__Fungi | p__Ascomycota | c__Dothideomycetes | o__Cladosporiales | f__Cladosporiaceae | g__Cladosporium | s__herbarum |
| CGCACCTTGCGCTCCTTGGTATTCCGAGGAGCATGCCTGTTTGAGTGTCATGAAATCTTCAACTTACAGACCTTTGCAGGTTTGTAGGCTTGGACTTGGAGGCTTGTCGGCCGTGTTTCGGCCGGCTCCTCTTAAATGCATTAGCTTGATTCCTTGCGGATCGGCTCTCGGTGTGATAATGTCTACGCCGCGACCGTGAAGCGTTTTGGCGAGCTTCTAACCGTCCTCGTTTGGACAGCTTTATGACCTCTGACCTCAAATCAGGTAGGACTACCCGCTGAACTTAA | k__Fungi | p__Basidiomycota | c__Agaricomycetes | o__Polyporales | f__Ganodermataceae | g__Ganoderma | NA |
| CGCACATTGCGCCCTTTGGTATTCCAAAGGGCATGCCTGTTCGAGCGTCATTTGTACCCTCAAGCTTTGCTTGGTGTTGGGCGTCTTGTCTCTAGCTTTGCTGGAGACTCGCCTTAAAGTAATTGGCAGCCGGCCTACTGGTTTCGGAGCGCAGCACAAGTCGCACTCTCTATCAGCAAAGGTCTAGCATCCATTAAGCCTTTTTTTCAACTTTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAA | k__Fungi | p__Ascomycota | c__Dothideomycetes | o__Pleosporales | f__Pleosporaceae | g__Alternaria | NA |
| CGCAACTTGCGCCCTTTGGTATTCCGAAGGGCATGCCTGTTTGAGAGTCATGAAAATCTCAATCCCTCGGGTTTTATTACCTGTTGGACTTGGATTTGGGTGTTTGCCGCGACCTGCAAAGGACGTCGGCTCGCCTTAAATGTGTTAGTGGGAAGGTGATTACCTGTCAGCCCGGCGTAATAAGTTTCGCTGGGCCTATGGGGTAGTCTTCGGCTTGCTGATAACAACCATCTCTTTTTGTTTGACCTCAAATCAGGTAGGGCTACCCGCTGAACTTAA | k__Fungi | p__Basidiomycota | c__Tremellomycetes | o__Tremellales | f__Cryptococcaceae | g__Cryptococcus | s__neoformans |
| CGCACATTGCGCCCTCTGGTATTCCGGAGGGCATGCCTGTCCGAGCGTCATTGCTGCCCTCAAGCCCGGCTTGTGTGTTGGGCCCCGTCCCCCCCGCCGGGGGGACGGGCCCGAAAGGCAGCGGCGGCACCGCGTCCGGTCCTCGAGCGTATGGGGCTTCGTCACCCGCTCTAGTAGGCCCGGCCGGCGCCAGCCGACCCCCAACCTTTAATTATCTCAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAA | k__Fungi | p__Basidiomycota | c__Agaricomycetes | o__Agaricales | f__Amanitaceae | g__Amanita | s__caesareoides |
| CGCACATGGCGCCTTCCAGTATCCTGGGAGGCATGCCTGTCCGAGCGTCGTTTCAACCCTCGAGCCCCCGTGGCCCGGCGTTGGGGATCTGCCCAGGCAGTCCCCGAAAACCAGTGGCGGACCCGTTACAGGCCCTTCCCTTGCGTAGTAGCATTCGCCTCGCATCGGGAGCCGGCGGGCTTCCAGCCTCTAAACCCCCATCAAGTCCGCCCCGGCGGCACCAAGGTTGACCTCGGATCAGGTAGGAATACCCGCTGAACTTAA | k__Fungi | p__Ascomycota | c__Sordariomycetes | o__Glomerellales | f__Plectosphaerellaceae | g__Gibellulopsis | NA |
| CGCACATTGCGCCCCTTGGTATTCCCTGGGGCATGCCTGTTCGAGCGTCATTTGTACCTTCAAGCTATGCTTGGTGTTGGGTGTTTGTCTCGCCTTTTGCGTGTAGACTCGCCTTAAAACAATTGGCAGCCGGCGTATTGATTTCGGAGCGCAGTACATCTCGCGCTTTGCACTCATAACGACGACATCCAAAAGTCATTTTTTACACTCTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAA | k__Fungi | p__Ascomycota | c__Dothideomycetes | o__Pleosporales | f__Didymellaceae | NA | NA |
| CGCACATTGCGCCCTTTGGTATTCCCAAGGGCATGCCTGTTCGAGCGTCATTTGTACCCTCAAGCTTTGCTTGGTGTTGGGCGTCTTGTCTCTAGCTTTGCTGGAGACTCGCCTTAAAGTAATTGGCAGCCGGCCTACTGGTTTCGGAGCGCAGCACAAGTCGCACTCTCTATCAGCAAAGGTCTAGCATCCATTAAGCCTTTTTTTCAACTTTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAA | k__Fungi | NA | NA | NA | NA | NA | NA |
| CGCACATGGCGCCTTCCCGTATCCTGGGAGGCATGCCTGTCCGAGCGTCGTTTCAACCCTCGAGCCCCCGTGGCCCGGCGTTGGGGATCTGCCCAGGCAGTCCCCGAAAACCAGTGGCGGACCCGTTACAGGCCCTTCCCTTGCGTAGTAGCATTCGCCTCGCATCGGGAGCCGGCGGGCTTCCAGCCTCTAAACCCCCATCAAGTCCGCCCCGGCGGCACCAAGGTTGACCTCGGATCAGGTAGGAATACCCGCTGAACTTAA | k__Fungi | p__Ascomycota | c__Sordariomycetes | o__Glomerellales | f__Plectosphaerellaceae | g__Gibellulopsis | NA |
| TTTTTTTTTTAATGATACGGCGACCACCGAGATCTACACTTAAGTTGTGTCGTCGGCAGCGTCAGATGTGTATAAGAGACAGGTGAATCATCGAATAAGCGGAGGACTGTCTCTTATACACATCTCCGAGCCCACGAGACCACTCAATTCATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAA | NA | NA | NA | NA | NA | NA | NA |
| TTTTTTTTTTAATGATACGGCGACCACCGAGATCTACACCCTGATACAATCGTCGGCAGCGTCAGATGTGTATAAGAGACAGGTGAATCATCGAATAAGCGGAGGACTGTCTCTTATACACATCTCCGAGCCCACGAGACTGCCGGTCAGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAA | NA | NA | NA | NA | NA | NA | NA |
| CGCACATTGCGCCCGGTGGTATTCCGCCGGGCATGCCTGTTCGAGCGTCATTATAACCACTCAAGCCTGGCTTGGTATTGGGGTACGCGGTCCCGCGGCTCCTAAAATCAGTGGCGGTGCCGGTGGGCTCTAAGCGTAGTACATCTCCTCGCTATAGGGTCCCTCTGGTTGCTTGCCAAAACCCCCCATTTTTACAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAA | NA | NA | NA | NA | NA | NA | NA |
| CGCACATTGCGCCCGCTGGTATTCCGGCGGGCATGCCTGTTCGAGCGTCATTTCAACCCTCAAGCCCTCGGGTTTGGTGTTGGGGATCGGCTCTGCCTTCTGGCGGTGCCGCCCCCGAAATACATTGGCGGTCTCGCTGCAGCCTCCATTGCGTAGTAGCTAACACCTCGCAACTGGAACGCGGCGCGGCCATGCCGTAAAACCCCAACTTCTGAATGTTGACCTCGGATCAGGTAGGAATACCCGCTGAACTTAA | k__Fungi | p__Ascomycota | c__Sordariomycetes | o__Hypocreales | f__Nectriaceae | g__Fusarium | s__tricinctum |
| sequence 1 | sequence 2 | sequence 3 | sequence 4 | sequence 5 | sequence 6 | sequence 7 | sequence 8 | sequence 9 | sequence 10 | sequence 11 | sequence 12 | sequence 13 | sequence 14 | sequence 15 | sequence 16 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A1T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A2T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A2T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A3T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A3T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A4T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A4T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A5T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A6T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A6T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A7T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A8T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| A9T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| C1T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| C2T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| C3T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| C4T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| C5T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| C6T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| F1T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| F3T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| FU1T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| FU2T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| FU3T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| FU3T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| HT | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| mock16S | 3409 | 3034 | 3050 | 2741 | 1935 | 1432 | 649 | 586 | 585 | 556 | 545 | 477 | 434 | 313 | 41 | 2 |
| mock16S1 | 2896 | 2586 | 2563 | 2443 | 1381 | 1264 | 648 | 497 | 494 | 481 | 443 | 471 | 414 | 313 | 0 | 0 |
| O10T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O11T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O12T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O1T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O2T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O3T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O4T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O5T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O6T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O7T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O8T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O9T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| O9T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T10T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T10T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T11T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T11T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T12T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T12T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T1T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T3T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T5T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T9T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| T9T1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| U3T | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Don’t forget to scroll to the right!
| Kingdom | Phylum | Class | Order | Family | Genus | Species | |
|---|---|---|---|---|---|---|---|
| TGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGTGTTGTGGTTAATAACCGCAGCAATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTTGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACA | Bacteria | Proteobacteria | Gammaproteobacteria | Enterobacterales | Enterobacteriaceae | Salmonella | enterica |
| TGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACA | Bacteria | Proteobacteria | Gammaproteobacteria | Enterobacterales | Enterobacteriaceae | Escherichia-Shigella | coli |
| TAGGGAATCTTCCGCAATGGACGAAAGTCTGACGGAGCAACGCCGCGTGAGTGATGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTAGGGAAGAACAAGTACCGTTCGAATAGGGCGGTACCTTGACGGTACCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTCGCAGGCGGTTCCTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGGAACTTGAGTGCAGAAGAGGAGAGTGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGACTCTCTGGTCTGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA | Bacteria | Firmicutes | Bacilli | Bacillales | Bacillaceae | Bacillus | halotolerans |
| TAGGGAATCTTCCACAATGGGCGCAAGCCTGATGGAGCAACACCGCGTGAGTGAAGAAGGGTTTCGGCTCGTAAAGCTCTGTTGTTAAAGAAGAACACGTATGAGAGTAACTGTTCATACGTTGACGGTATTTAACCAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGAGAGTGCAGGCGGTTTTCTAAGTCTGATGTGAAAGCCTTCGGCTTAACCGGAGAAGTGCATCGGAAACTGGATAACTTGAGTGCAGAAGAGGGTAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTACCTGGTCTGCAACTGACGCTGAGACTCGAAAGCATGGGTAGCGAACA | Bacteria | Firmicutes | Bacilli | Lactobacillales | Lactobacillaceae | Limosilactobacillus | NA |
| TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGTCTTCGGATTGTAAAGCACTTTAAGTTGGGAGGAAGGGCAGTAAGTTAATACCTTGCTGTTTTGACGTTACCAACAGAATAAGCACCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTCAGCAAGTTGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCCAAAACTACTGAGCTAGAGTACGGTAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGACTGATACTGACACTGAGGTGCGAAAGCGTGGGGAGCAAACA | Bacteria | Proteobacteria | Gammaproteobacteria | Pseudomonadales | Pseudomonadaceae | Pseudomonas | aeruginosa |
| TAGGGAATCTTCCGCAATGGGCGAAAGCCTGACGGAGCAACGCCGCGTGAGTGATGAAGGTCTTCGGATCGTAAAACTCTGTTATTAGGGAAGAACATATGTGTAAGTAACTGTGCACATCTTGACGGTACCTAATCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGAAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGCAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACGCTGATGTGCGAAAGCGTGGGGATCAAACA | Bacteria | Firmicutes | Bacilli | Staphylococcales | Staphylococcaceae | Staphylococcus | aureus |
| TAGGGAATCTTCCACAATGGGCGCAAGCCTGATGGAGCAACCCCGCGTGAGTGAAGAAGGGTTTCGGCTCGTAAAGCTCTGTTGTTAAAGAAGAACACGTATGAGAGTAACTGTTCATACGTTGACGGTATTTAACCAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGAGAGTGCAGGCGGTTTTCTAAGTCTGATGTGAAAGCCTTCGGCTTAACCGGAGAAGTGCATCGGAAACTGGATAACTTGAGTGCAGAAGAGGGTAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTACCTGGTCTGCAACTGACGCTGAGACTCGAAAGCATGGGTAGCGAACA | Bacteria | Firmicutes | Bacilli | Lactobacillales | Lactobacillaceae | Limosilactobacillus | NA |
| TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGTCTTCGGATTGTAAAGCACTTTAAGTTGGGAGGAAGGGTTGTAGATTAATACTCTGCAATTTTGACGTTACCGACAGAATAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAATACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTTGTTAAGTTGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCAAAACTGACTGACTAGAGTATGGTAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGACTAATACTGACACTGAGGTGCGAAAGCGTGGGGAGCAAACA | Bacteria | Proteobacteria | Gammaproteobacteria | Pseudomonadales | Pseudomonadaceae | Pseudomonas | azotoformans |
| TGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGTGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACA | Bacteria | Proteobacteria | Gammaproteobacteria | Enterobacterales | Enterobacteriaceae | Escherichia-Shigella | NA |
| TGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGTGTTGCGGTTAATAACCGCAGCAATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTTGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACA | Bacteria | Proteobacteria | Gammaproteobacteria | Enterobacterales | Enterobacteriaceae | Salmonella | enterica |
| TAGGGAATCTTCCGCAATGGACGAAAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAAAGTACTGTTGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTTGACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGCGCGCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCCCCGGCTTAACCGGGGAGGGTCATTGGAAACTGGAAGACTGGAGTGCAGAAGAGGAGAGTGGAATTCCACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAACACCAGTGGCGAAGGCGACTCTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACA | Bacteria | Firmicutes | Bacilli | Lactobacillales | Listeriaceae | Listeria | monocytogenes |
| TAGGGAATATTCCGCAATGGACGAAAGTCTGACGGAGCAACGCCGCGTGAGTGATGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTAGGGAAGAACAAGTACCGTTCGAATAGGGCGGTACCTTGACGGTACCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTCGCAGGCGGTTCCTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGGAACTTGAGTGCAGAAGAGGAGAGTGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGACTCTCTGGTCTGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA | Bacteria | Firmicutes | Bacilli | Bacillales | Bacillaceae | Bacillus | intestinalis |
| TAGGGAATCTTCGGCAATGGACGAAAGTCTGACCGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAACTCTGTTGTTAGAGAAGAACAAGGACGTTAGTAACTGAACGTCCCCTGACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGAGACTTGAGTGCAGAAGAGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACA | Bacteria | Firmicutes | Bacilli | Lactobacillales | Enterococcaceae | Enterococcus | faecalis |
| TAGGGAATCTTCCGCAATGGGCGAAAGCCTGACGGAGCAACGCCGCGTGAGTGATGAAGGTCTTCGGATCGTAAAGCTCTGTTATTAGGGAAGAACATATGTGTAAGTAACTGTGCACATCTTGACGGTACCTAATCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGAAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGCAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACGCTGATGTGCGAAAGCGTGGGGATCAAACA | Bacteria | Firmicutes | Bacilli | Staphylococcales | Staphylococcaceae | Staphylococcus | aureus |
| TGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTAGGGTTGTAAAGCACTTTCAGCGAGGAGGAAGGGTTCAGTGTTAATAGCACTGTGCATTGACGTTACTCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGAGCTTAACTTGGGAACTGCATTTGAAACTGGCAAGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACA | Bacteria | Proteobacteria | Gammaproteobacteria | Enterobacterales | Yersiniaceae | Serratia | plymuthica |
| TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGTCTTCGGATTGTAAAGCACTTTAAGTTGGGAGGAAGGGTTGTAGATTAATACTCTGCAATTTTGACGTTACCGACAGAATAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAATACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTTGTTAAGTTGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAAACTGACTGACTAGAGTATGGTAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGACTAATACTGACACTGAGGTGCGAAAGCGTGGGGAGCAAACA | Bacteria | Proteobacteria | Gammaproteobacteria | Pseudomonadales | Pseudomonadaceae | Pseudomonas | NA |
We have found 4812 unique sequences, which is an awful lot. Many of them look alike, which I presume is mostly due to evolutionary relationships, but which is sometimes also due presence of some unrecognized bits of primers. I will cluster sequences using 5-mer distances, and if sequences look extremely similar, I may merge some.
Furthermore, some other unexpected sequences are present, which we filter out as well.
One thing I noticed, is that on rare occasions, sequences are actually virtually the same, except for a primer part that has not been recognized because its not similar enough to the primer sequences we gave for filtering. I inferred this from 5-mer distances (see below), and just want to show one example as an illustration here:
seqs4 <- DNAStringSet(x=colnames(XITS$seqtab.nochim)[c(1,2563,2611,3097)])
aln4 <- msa(seqs4)
## use default substitution matrix
For now I don’t manage to output the alignment. Don’t bother.
#knitr::raw_latex(msaPrettyPrint(x= aln4 , output="asis"))
knitr::pandoc_to()
[1] “html”
#knitr::raw_latex(msaPrettyPrint(x= aln4 , output="asis"))
knitr::raw_latex("\\emph{some text}")
#knitr::asis_output(msaPrettyPrint(x= aln4 , output="asis"))
If we look how often these are counted, we observe that in this case, there clearly is a main variant (variant 1). Variant 2 is just a SNP away from variant 1, which may well reflect a natural situation. It occurs in only one sample (A4T) and given its close proximity to variant 1 and its low frequency compared to the main variant, it can easily be “collapsed” with variant 1.
Variants 3 and 4 also occur in small numbers in single samples, but their origin is less clear to me. If they are a sequencing artifact, it’s worrisome that they occur in 38 and 25 copies. If they would just be a natural variant of the primer binding region, there is less reason to worry and they can be collapsed with variant 1 as well. BUT THIS HAS TO BE CHECKED
unname(XITS$seqtab.nochim[,c(1,2563,2611,3097)])
| variant 1 | variant 2 | variant 3 | variant 4 | |
|---|---|---|---|---|
| A1T | 0 | 0 | 0 | 0 |
| A2T | 19460 | 0 | 0 | 0 |
| A2T1 | 24988 | 0 | 0 | 0 |
| A3T | 0 | 0 | 0 | 0 |
| A3T1 | 0 | 0 | 0 | 0 |
| A4T | 21478 | 38 | 0 | 0 |
| A4T1 | 33565 | 0 | 0 | 25 |
| A6T | 10394 | 0 | 0 | 0 |
| A6T1 | 7064 | 0 | 0 | 0 |
| A7T | 0 | 0 | 0 | 0 |
| A8T | 11989 | 0 | 0 | 0 |
| A9T | 3314 | 0 | 0 | 0 |
| C1T | 0 | 0 | 0 | 0 |
| C2T | 0 | 0 | 0 | 0 |
| C3T | 0 | 0 | 0 | 0 |
| C5T | 0 | 0 | 0 | 0 |
| C6T | 41 | 0 | 0 | 0 |
| F1T | 33 | 0 | 0 | 0 |
| F3T | 0 | 0 | 0 | 0 |
| FU1T | 0 | 0 | 0 | 0 |
| FU2T | 0 | 0 | 0 | 0 |
| FU3T | 0 | 0 | 0 | 0 |
| FU3T1 | 0 | 0 | 0 | 0 |
| HT | 0 | 0 | 0 | 0 |
| mockITS | 0 | 0 | 0 | 0 |
| mockITS1 | 0 | 0 | 0 | 0 |
| O10T | 0 | 0 | 0 | 0 |
| O11T | 0 | 0 | 0 | 0 |
| O12T | 0 | 0 | 0 | 0 |
| O2T | 0 | 0 | 0 | 0 |
| O3T | 0 | 0 | 0 | 0 |
| O4T | 0 | 0 | 0 | 0 |
| O5T | 0 | 0 | 0 | 0 |
| O6T | 0 | 0 | 0 | 0 |
| O8T | 0 | 0 | 0 | 0 |
| O9T | 2090 | 0 | 0 | 0 |
| O9T1 | 2373 | 0 | 0 | 0 |
| T10T | 0 | 0 | 0 | 0 |
| T10T1 | 0 | 0 | 0 | 0 |
| T11T | 6035 | 0 | 0 | 0 |
| T12T | 94 | 0 | 0 | 0 |
| T12T1 | 30 | 0 | 0 | 0 |
| T5T | 35329 | 0 | 37 | 0 |
| T9T | 0 | 0 | 0 | 0 |
| T9T1 | 0 | 0 | 0 | 0 |
| U3T | 0 | 0 | 0 | 0 |
Another way to look at it, is to check the last couple of nucleotides of our sequences. They should almost invariably read “CTTAA” for ITS, and “AAACA” for 16S
table(unlist(lapply(colnames(XITS$seqtab.nochim),FUNS$substrRight,5)))
##
## AAAAA AGGCT CCTAA CTTAA CTTAT GAGGA GGAGA TTAAG TTTAA
## 17 1 1 4602 1 3 2 24 161
Sequences ending in AAAAA are not fungal ITS. They are probably bacterial, and may just be carryover from other sequencing projects. Only 651 counts (17 sequences) in total. The other endings are either some unrecognized primers (GAGGA and GGAGA, 5 sequences, 218 total counts), something unknown (AGGCT) or some variations of the amplicon (TTTAA, CCTAA,TTAAG). Given that all exceptions are very small number of sequences, we will only proceed with the amplicons ending at CTTAA, TTTAA, CCTAA and TTAAG
unname(XITS$seqtab.nochim[,which(unlist(lapply(colnames(XITS$seqtab.nochim), FUNS$substrRight,5))=="AAAAA")])
XITS$X <- XITS$seqtab.nochim[,unlist(lapply(colnames(XITS$seqtab.nochim), FUNS$substrRight,5) %in% c("CTTAA", "TTTAA", "CCTAA", "TTAAG"))]
We remove the mock samples, and also we discard those ASV’s that only belong to the mock samples:
XITS$X <- XITS$X[ ! rownames(XITS$X) %in% c("mockITS","mockITS1"),]
XITS$X <- XITS$X[,colSums(XITS$X) > 0]
# same for 16S
X16S$X <- X16S$seqtab.nochim
X16S$X <- X16S$X[ ! rownames(X16S$X) %in% c("mock16S","mock16S1"),]
X16S$X <- X16S$X[,colSums(X16S$X) > 0]
NOTE TO SELF: we could (should?) have done this with CD-HIT as well. It’s probably less cumbersome.
We will now see if we can cluster variants based on k-mer distance. I do this in a two-step procedure for speed. First, we calculate the 5-mer distance between all sequences. I noticed that small differences, such as a SNP or the presence/absence of a primer was typically a 5-mer distance of > 0.001 but < 0.01. We cluster based on an initial k-mer distance of 5%. In a next step, we cluster within these clusters, based on real alignments, going to 97.5% sequence identity. This is our final cut-off for OTU clustering.
XITS$otus97.5 <- FUNS$otu_cluster(XITS$X)
X16S$otus97.5 <- FUNS$otu_cluster(X16S$X)
# Now we can generate count matrices for the otus rather than for ASVs
XITS$X_otu97.5 <- as.data.frame(XITS$X)
colnames(XITS$X_otu97.5) <- as.character(XITS$otus97.5)
XITS$X_otu97.5 <- as.matrix(as.data.frame(lapply(split.default(XITS$X_otu97.5, f = XITS$otus97.5), rowSums)))
X16S$X_otu97.5 <- as.data.frame(X16S$X)
colnames(X16S$X_otu97.5) <- as.character(X16S$otus97.5)
X16S$X_otu97.5 <- as.matrix(as.data.frame(lapply(split.default(X16S$X_otu97.5, f = X16S$otus97.5), rowSums)))
We will base our decisions about merging and removing on alpha and beta diversity measures, because they take into account to what extent samples are similar to each other. First we generate sublists within XITS and X16S that will contain alpha diversity for the clustered OTU’s. We will use these throughout the rest of the pipeline, but in this manner, we can add a very similar analysis for other OTU cutoffs if we wish to do so.
XITS$alpha_otu97.5 <- list()
X16S$alpha_otu97.5 <- list()
XITS$alpha_asv <- list()
X16S$alpha_asv <- list()
We will now look at most common alpha diversity measures, both for ITS and 16S, on the OTU level. The OTU’s are clustered at 97.5% identity, which is very stringent. I only do this to prevent sequencing errors or super rare variants from slipping in, influencing our results. In order to make sure we are not introducing too much bias, I first do two quick checks: are some diversity measures correlated with sample size (sequencing depth)? And, are some diversity measures influenced a lot by using the OTU’s rather than the ASV’s? For this quick check I use the alpha function from the microbiome package, since it unites most common alpha diversity measures.
XITS$alpha_otu97.5$alpha <- cbind.data.frame(microbiome::alpha(t(XITS$X_otu97.5)),"total"=rowSums(XITS$X_otu97.5))
X16S$alpha_otu97.5$alpha <- cbind.data.frame(microbiome::alpha(t(X16S$X_otu97.5)),"total"=rowSums(X16S$X_otu97.5))
XITS$alpha_asv$alpha <- cbind.data.frame(microbiome::alpha(t(XITS$X)),"total"=rowSums(XITS$X))
X16S$alpha_asv$alpha <- cbind.data.frame(microbiome::alpha(t(X16S$X)),"total"=rowSums(X16S$X))
Next, I visualize potential biases with scatterplots. For the correlations between sequencing depth and alpha diversity measures, I use a red colour in case the correlation is significant (p<0.05).
# Relationship between OTU's and ASV's for ITS:
par(mfrow=c(6,4))
for (i in c(1:17,19:22))
{
plot(XITS$alpha_otu97.5$alpha[,i],XITS$alpha_asv$alpha[,i],main=paste("ITS:",colnames(XITS$alpha_otu97.5$alpha)[i]),xlab="OTU",ylab="ASV")
}
# Correlation with sequencing depth for ITS:
par(mfrow=c(6,4))
for (i in c(1:17,19:22))
{
p <- cor.test(XITS$alpha_otu97.5$alpha[,i],XITS$alpha_otu97.5$alpha[,23])$p.value
r <- cor(XITS$alpha_otu97.5$alpha[,i],XITS$alpha_otu97.5$alpha[,23])
corcol <- ifelse(p<0.05,yes = "darkred", no ="black")
plot(XITS$alpha_otu97.5$alpha[,i],XITS$alpha_otu97.5$alpha[,23],main=paste("ITS:", colnames(XITS$alpha_otu97.5$alpha)[i]),xlab="alpha diversity estimate", ylab= "sequencing depth",col=corcol)
}
# Relationship between OTU's and ASV's for 16S:
par(mfrow=c(6,4))
for (i in c(1:17,19:22))
{
plot(X16S$alpha_otu97.5$alpha[,i],X16S$alpha_asv$alpha[,i],main=paste("16S:", colnames(X16S$alpha_otu97.5$alpha)[i]),xlab="OTU",ylab="ASV")
}
# Correlation with sequencing depth for 16S:
par(mfrow=c(6,4))
for (i in c(1:17,19:22))
{
p <- cor.test(X16S$alpha_otu97.5$alpha[,i],X16S$alpha_otu97.5$alpha[,23])$p.value
r <- cor(X16S$alpha_otu97.5$alpha[,i],X16S$alpha_otu97.5$alpha[,23])
corcol <- ifelse(p<0.05,yes = "darkred", no ="black")
plot(X16S$alpha_otu97.5$alpha[,i],X16S$alpha_otu97.5$alpha[,23],main=paste("16S:", colnames(X16S$alpha_otu97.5$alpha)[i]),xlab="alpha diversity estimate", ylab= "sequencing depth",col=corcol)
}
I conclude that for most indices, OTU vs ASV does not really matter. “dominance_core_abundance” cannot be calculated for ASV’s. The highest correlations are obviously in richness/diversity, while dominance and especially rarity will be impacted a bit more. The correlation is weakest for evenness Camargo.
For what’s concerned correlation between indices and sample size: there may be problem, because there are typically significant correlations, though never strong. Nasrin: you looked into saturation plots. They should be added here.
I find little evidence for outliers at the level of ITS, but for 16S there are two samples that are clearly outliers: A5T and T3T This becomes clear with a quick pairs plot presenting a sample of alpha diversity measures, where A5T is colored red, and T3T orange. A5T and T3T must be dominated by a couple of very abundant OTU’s, while otherwise being composed of many rare taxa.
We will also find out that these samples have an extreme influence on ordination of beta diversity measures, and I think it would be best to exclude them altogether, also because they don’t have particularly interesting values for 137Cs.
cols <- rep("black",51)
cols[8] <- "red" # A5T
cols[47] <- "orange" # T3T
pairs(X16S$alpha_otu97.5$alpha[c(1,4,5,6,8,9,11,14,22)],col=cols,pch=15)
We will use some simple beta diversity based distance plots to show that technical replicates are very similar:
XITS$beta_asv <- list()
X16S$beta_asv <- list()
XITS$beta_otu97.5 <- list()
X16S$beta_otu97.5 <- list()
Bray Curtis distance based on the ASVs. Notice how the pairs of technical replicates are close to each other, but not necessarily always overlapping. Also notice how C5T and C6T, outliers on the level of readthroughs, seem to behave perfectly normal here.
XITS$beta_asv$BCdist <- vegan::vegdist(XITS$X, method = "bray")
FUNS$betaplot(XITS$beta_asv$BCdist,2,"ITS: NMDS based on Bray Curtis distance")
## Run 0 stress 0.2415223
## Run 1 stress 0.2389663
## ... New best solution
## ... Procrustes: rmse 0.100488 max resid 0.5750439
## Run 2 stress 0.2379415
## ... New best solution
## ... Procrustes: rmse 0.05456371 max resid 0.2427361
## Run 3 stress 0.2379416
## ... Procrustes: rmse 9.343905e-05 max resid 0.0004233384
## ... Similar to previous best
## Run 4 stress 0.2379415
## ... Procrustes: rmse 0.0001734942 max resid 0.0007668742
## ... Similar to previous best
## Run 5 stress 0.2421108
## Run 6 stress 0.2379415
## ... Procrustes: rmse 1.574039e-05 max resid 9.149001e-05
## ... Similar to previous best
## Run 7 stress 0.2382823
## ... Procrustes: rmse 0.0483783 max resid 0.2352576
## Run 8 stress 0.2379415
## ... Procrustes: rmse 3.071133e-05 max resid 7.968857e-05
## ... Similar to previous best
## Run 9 stress 0.2421101
## Run 10 stress 0.2442064
## Run 11 stress 0.2415224
## Run 12 stress 0.2614215
## Run 13 stress 0.2434778
## Run 14 stress 0.2400955
## Run 15 stress 0.2722964
## Run 16 stress 0.238213
## ... Procrustes: rmse 0.04154184 max resid 0.2394212
## Run 17 stress 0.2685232
## Run 18 stress 0.2389663
## Run 19 stress 0.2415223
## Run 20 stress 0.2400955
## *** Best solution repeated 4 times
mean(XITS$beta_asv$BCdist)
## [1] 0.9592109
median(XITS$beta_asv$BCdist)
## [1] 0.9886755
as.matrix(XITS$beta_asv$BCdist)["A2T","A2T1"]
## [1] 0.432957
as.matrix(XITS$beta_asv$BCdist)["A3T","A3T1"]
## [1] 0.2553679
as.matrix(XITS$beta_asv$BCdist)["A4T","A4T1"]
## [1] 0.2558528
as.matrix(XITS$beta_asv$BCdist)["A6T","A6T1"]
## [1] 0.2464886
as.matrix(XITS$beta_asv$BCdist)["T10T","T10T1"]
## [1] 0.4866073
as.matrix(XITS$beta_asv$BCdist)["T12T","T12T1"]
## [1] 0.523499
as.matrix(XITS$beta_asv$BCdist)["O9T","O9T1"]
## [1] 0.3184464
as.matrix(XITS$beta_asv$BCdist)["FU3T","FU3T1"]
## [1] 0.4055995
# see also:
plot(XITS$mocks[1,],XITS$mocks[2,],log="xy")
## Warning in xy.coords(x, y, xlabel, ylabel, log): 4796 x values <= 0 omitted
## from logarithmic plot
## Warning in xy.coords(x, y, xlabel, ylabel, log): 4795 y values <= 0 omitted
## from logarithmic plot
For 16S, it looks less nice. I’m sure the culprit is the outliers A5T and T3T (see further). Try running it again without them.
X16S$beta_asv$BCdist <- vegan::vegdist(X16S$X, method = "bray")
FUNS$betaplot(X16S$beta_asv$BCdist,2,"16S: NMDS based on Bray Curtis distance")
## Run 0 stress 0.2060517
## Run 1 stress 0.1643125
## ... New best solution
## ... Procrustes: rmse 0.09496938 max resid 0.4188215
## Run 2 stress 0.1671694
## Run 3 stress 0.2148475
## Run 4 stress 0.1671446
## Run 5 stress 0.165424
## Run 6 stress 0.2139795
## Run 7 stress 0.1711995
## Run 8 stress 0.1914279
## Run 9 stress 0.2088918
## Run 10 stress 0.1723366
## Run 11 stress 0.211506
## Run 12 stress 0.1921776
## Run 13 stress 0.1643056
## ... New best solution
## ... Procrustes: rmse 0.001073852 max resid 0.005743202
## ... Similar to previous best
## Run 14 stress 0.1712615
## Run 15 stress 0.1721679
## Run 16 stress 0.2175024
## Run 17 stress 0.1712155
## Run 18 stress 0.1723369
## Run 19 stress 0.2070686
## Run 20 stress 0.1723368
## *** Best solution repeated 1 times
mean(X16S$beta_asv$BCdist)
## [1] 0.9351964
median(X16S$beta_asv$BCdist)
## [1] 0.9656546
as.matrix(X16S$beta_asv$BCdist)["A2T","A2T1"]
## [1] 0.612074
as.matrix(X16S$beta_asv$BCdist)["A3T","A3T1"]
## [1] 0.4490529
as.matrix(X16S$beta_asv$BCdist)["A4T","A4T1"]
## [1] 0.471351
as.matrix(X16S$beta_asv$BCdist)["A6T","A6T1"]
## [1] 0.5988588
as.matrix(X16S$beta_asv$BCdist)["T10T","T10T1"]
## [1] 0.4246175
as.matrix(X16S$beta_asv$BCdist)["T12T","T12T1"]
## [1] 0.6122163
as.matrix(X16S$beta_asv$BCdist)["O9T","O9T1"]
## [1] 0.5738448
as.matrix(X16S$beta_asv$BCdist)["FU3T","FU3T1"]
## [1] 0.4506166
# see also:
plot(X16S$mocks[1,],X16S$mocks[2,],log="xy")
## Warning in xy.coords(x, y, xlabel, ylabel, log): 9806 x values <= 0 omitted
## from logarithmic plot
## Warning in xy.coords(x, y, xlabel, ylabel, log): 9808 y values <= 0 omitted
## from logarithmic plot
# So let's now collapse/sum duplicates:
XITS$x <- FUNS$collapse_duplicates(XITS$X)
X16S$x <- FUNS$collapse_duplicates(X16S$X)
# Remove outliers:
X16S$x <- X16S$x[!c(rownames(X16S$x) %in% c("A5T","T3T")),]
# Same for X_otu97.5
XITS$x_otu97.5 <- FUNS$collapse_duplicates(XITS$X_otu97.5)
X16S$x_otu97.5 <- FUNS$collapse_duplicates(X16S$X_otu97.5)
X16S$x_otu97.5 <- X16S$x_otu97.5[!c(rownames(X16S$x_otu97.5) %in% c("A5T","T3T")),]
Now we can move on to the proper alpha and beta diversity analyses!
We start with richness, choosing the simple measure of the total number of taxa.
XITS$alpha_otu97.5 <- list()
X16S$alpha_otu97.5 <- list()
XITS$alpha_asv <- list()
X16S$alpha_asv <- list()
XITS$alpha_otu97.5$alpha <- cbind.data.frame(microbiome::alpha(t(XITS$x_otu97.5)),"total"=rowSums(XITS$x_otu97.5))
X16S$alpha_otu97.5$alpha <- cbind.data.frame(microbiome::alpha(t(X16S$x_otu97.5)),"total"=rowSums(X16S$x_otu97.5))
XITS$alpha_asv$alpha <- cbind.data.frame(microbiome::alpha(t(XITS$x)),"total"=rowSums(XITS$x))
X16S$alpha_asv$alpha <- cbind.data.frame(microbiome::alpha(t(X16S$x)),"total"=rowSums(X16S$x))
Next, once again, we visualize potential biases with scatterplots. For the correlations between sequencing depth and alpha diversity measures, I use a red colour in case the correlation is significant (p<0.05).
# Relationship between OTU's and ASV's for ITS:
par(mfrow=c(6,4))
for (i in c(1:17,19:22))
{
plot(XITS$alpha_otu97.5$alpha[,i],XITS$alpha_asv$alpha[,i],main=paste("ITS:",colnames(XITS$alpha_otu97.5$alpha)[i]),xlab="OTU",ylab="ASV")
}
# Correlation with sequencing depth for ITS:
par(mfrow=c(6,4))
for (i in c(1:17,19:22))
{
p <- cor.test(XITS$alpha_otu97.5$alpha[,i],XITS$alpha_otu97.5$alpha[,23])$p.value
r <- cor(XITS$alpha_otu97.5$alpha[,i],XITS$alpha_otu97.5$alpha[,23])
corcol <- ifelse(p<0.05,yes = "darkred", no ="black")
plot(XITS$alpha_otu97.5$alpha[,i],XITS$alpha_otu97.5$alpha[,23],main=paste("ITS:", colnames(XITS$alpha_otu97.5$alpha)[i]),xlab="alpha diversity estimate", ylab= "sequencing depth",col=corcol)
}
# Relationship between OTU's and ASV's for 16S:
par(mfrow=c(6,4))
for (i in c(1:17,19:22))
{
plot(X16S$alpha_otu97.5$alpha[,i],X16S$alpha_asv$alpha[,i],main=paste("16S:", colnames(X16S$alpha_otu97.5$alpha)[i]),xlab="OTU",ylab="ASV")
}
# Correlation with sequencing depth for 16S:
par(mfrow=c(6,4))
for (i in c(1:17,19:22))
{
p <- cor.test(X16S$alpha_otu97.5$alpha[,i],X16S$alpha_otu97.5$alpha[,23])$p.value
r <- cor(X16S$alpha_otu97.5$alpha[,i],X16S$alpha_otu97.5$alpha[,23])
corcol <- ifelse(p<0.05,yes = "darkred", no ="black")
plot(X16S$alpha_otu97.5$alpha[,i],X16S$alpha_otu97.5$alpha[,23],main=paste("16S:", colnames(X16S$alpha_otu97.5$alpha)[i]),xlab="alpha diversity estimate", ylab= "sequencing depth",col=corcol)
}
# start with classical richness estimates:
XITS$alpha_otu97.5$richness <- microbiome::richness(t(XITS$x_otu97.5),index="observed")
X16S$alpha_otu97.5$richness <- microbiome::richness(t(X16S$x_otu97.5),index="observed")
rich_all <- merge(XITS$alpha_otu97.5$richness,X16S$alpha_otu97.5$richness,by="row.names",suffixes=c("ITS","16S"))
# No meaningfull correlation between richness of ITS or 16S samples
pairs(rich_all[,-1])
par(mfrow=c(2,3))
plot(XITS$alpha_otu97.5$richness[,1] ~ log10(META[rownames(XITS$alpha_otu97.5$richness),"Cs_137"]),col=sitecolors[as.numeric(as.factor(META[rownames(XITS$alpha_otu97.5$richness),"site"]))],pch=16,cex=2,main=paste("ITS", colnames(XITS$alpha_otu97.5$richness)[1]),ylab=colnames(XITS$alpha_otu97.5$richness)[1],xlab="log10 137 Cs")
plot(X16S$alpha_otu97.5$richness[,1] ~ log10(META[rownames(X16S$alpha_otu97.5$richness),"Cs_137"]),col=sitecolors[as.numeric(as.factor(META[rownames(X16S$alpha_otu97.5$richness),"site"]))],pch=16,cex=2,main=paste("16S", colnames(X16S$alpha_otu97.5$richness)[1]),ylab=colnames(X16S$alpha_otu97.5$richness)[1],xlab="log10 137 Cs")
rm(rich_all)
We move on to the five diversity measures from the microbiome diversity function. Zero counts are included here, but excluding them did not change any of the conclusions (it did not make a big difference in the measures anyhow). Some things I noticed:
# start with classical diversity estimates:
XITS$alpha_otu97.5$diversity <- microbiome::diversity(t(XITS$x_otu97.5))
X16S$alpha_otu97.5$diversity <- microbiome::diversity(t(X16S$x_otu97.5))
div_all <- merge(XITS$alpha_otu97.5$diversity,X16S$alpha_otu97.5$diversity,by="row.names",suffixes=c("ITS","16S"))
# First diversity plots for ITS:
par(mfrow=c(2,3))
for(i in 1:5)
{
plot(XITS$alpha_otu97.5$diversity[,i] ~ log10(META[rownames(XITS$alpha_otu97.5$diversity),"Cs_137"]),col=sitecolors[as.numeric(as.factor(META[rownames(XITS$alpha_otu97.5$diversity),"site"]))],pch=16,cex=2,main=paste("ITS", colnames(XITS$alpha_otu97.5$diversity)[i]),ylab=colnames(XITS$alpha_otu97.5$diversity)[i],xlab="log10 137 Cs")
}
# Same for bacteria
par(mfrow=c(2,3))
for(i in 1:5)
{
plot(X16S$alpha_otu97.5$diversity[,i] ~ log10(META[rownames(X16S$alpha_otu97.5$diversity),"Cs_137"]),col=sitecolors[as.numeric(as.factor(META[rownames(X16S$alpha_otu97.5$diversity),"site"]))],pch=16,cex=2,main=paste("16S", colnames(X16S$alpha_otu97.5$diversity)[i]),ylab=colnames(X16S$alpha_otu97.5$diversity)[i],xlab="log10 137 Cs")
}
pairs(div_all[,-1])
# Inverse Simpson: p-value = 0.001955
#cor.test(div_all[,2],div_all[,7])
# Gini Simpson: p-value = 0.7551
#cor.test(div_all[,3],div_all[,8])
# Shannon: p-value = 0.02814
#cor.test(div_all[,4],div_all[,9])
# Fisher: p-value = 0.844
#cor.test(div_all[,5],div_all[,10])
# Coverage: p-value = 0.001783
#cor.test(div_all[,6],div_all[,11])
rm(div_all)
Similarly, very few trends. Only a super weak negative correlation at the level of Simpson evenness
# start with classical evenness estimates:
XITS$alpha_otu97.5$evenness <- microbiome::evenness(t(XITS$x_otu97.5))
X16S$alpha_otu97.5$evenness <- microbiome::evenness(t(X16S$x_otu97.5))
# First evenness plots for ITS:
eve_all <- merge(XITS$alpha_otu97.5$evenness,X16S$alpha_otu97.5$evenness,by="row.names",suffixes=c("ITS","16S"))
par(mfrow=c(2,3))
for(i in 1:5)
{
plot(XITS$alpha_otu97.5$evenness[,i] ~ log10(META[rownames(XITS$alpha_otu97.5$evenness),"Cs_137"]),col=sitecolors[as.numeric(as.factor(META[rownames(XITS$alpha_otu97.5$evenness),"site"]))],pch=16,cex=2,main=paste("ITS", colnames(XITS$alpha_otu97.5$evenness)[i]),ylab=colnames(XITS$alpha_otu97.5$evenness)[i],xlab="log10 137 Cs")
}
# Same for bacteria
par(mfrow=c(2,3))
for(i in 1:5)
{
plot(X16S$alpha_otu97.5$evenness[,i] ~ log10(META[rownames(X16S$alpha_otu97.5$evenness),"Cs_137"]),col=sitecolors[as.numeric(as.factor(META[rownames(X16S$alpha_otu97.5$evenness),"site"]))],pch=16,cex=2,main=paste("16S", colnames(X16S$alpha_otu97.5$evenness)[i]),ylab=colnames(X16S$alpha_otu97.5$evenness)[i],xlab="log10 137 Cs")
}
pairs(eve_all[,-1])
#cor.test(eve_all[,2],eve_all[,7])
#cor.test(eve_all[,3],eve_all[,8])
cor.test(eve_all[,4],eve_all[,9]) # Simpson evenness
##
## Pearson's product-moment correlation
##
## data: eve_all[, 4] and eve_all[, 9]
## t = -1.3485, df = 33, p-value = 0.1867
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5220266 0.1133414
## sample estimates:
## cor
## -0.2285372
#cor.test(eve_all[,5],eve_all[,10])
#cor.test(eve_all[,6],eve_all[,11])
rm(eve_all)
# start with classical dominance estimates:
XITS$alpha_otu97.5$dominance <- microbiome::dominance(t(XITS$x_otu97.5))[,-1] # exclude dbp, which is essentially the same as "relative dominance"
X16S$alpha_otu97.5$dominance <- microbiome::dominance(t(X16S$x_otu97.5))[,-1]
dom_all <- merge(XITS$alpha_otu97.5$dominance,X16S$alpha_otu97.5$dominance,by="row.names",suffixes=c("ITS","16S"))
# First dominance plots for ITS:
par(mfrow=c(2,3))
for(i in 1:6)
{
plot(XITS$alpha_otu97.5$dominance[,i] ~ log10(META[rownames(XITS$alpha_otu97.5$dominance),"Cs_137"]),col=sitecolors[as.numeric(as.factor(META[rownames(XITS$alpha_otu97.5$dominance),"site"]))],pch=16,cex=2,main=paste("ITS", colnames(XITS$alpha_otu97.5$dominance)[i]),ylab=colnames(XITS$alpha_otu97.5$dominance)[i],xlab="log10 137 Cs")
}
# Same for bacteria
par(mfrow=c(2,3))
for(i in 1:6)
{
plot(X16S$alpha_otu97.5$dominance[,i] ~ log10(META[rownames(X16S$alpha_otu97.5$dominance),"Cs_137"]),col=sitecolors[as.numeric(as.factor(META[rownames(X16S$alpha_otu97.5$dominance),"site"]))],pch=16,cex=2,main=paste("16S", colnames(X16S$alpha_otu97.5$dominance)[i]),ylab=colnames(X16S$alpha_otu97.5$dominance)[i],xlab="log10 137 Cs")
}
pairs(dom_all[,-1])
#cor.test(dom_all[,2],dom_all[,8])
#cor.test(dom_all[,3],dom_all[,9])
#cor.test(dom_all[,4],dom_all[,10])
#cor.test(dom_all[,5],dom_all[,11])
#cor.test(dom_all[,6],dom_all[,12])
#cor.test(dom_all[,7],dom_all[,13])
rm(dom_all)
And finally rarity. No interesting trends or correlations here either.
# start with classical rarity estimates:
XITS$alpha_otu97.5$rarity <- microbiome::rarity(t(XITS$x_otu97.5))
X16S$alpha_otu97.5$rarity <- microbiome::rarity(t(X16S$x_otu97.5))
rar_all <- merge(XITS$alpha_otu97.5$rarity,X16S$alpha_otu97.5$rarity,by="row.names",suffixes=c("ITS","16S"))
par(mfrow=c(2,3))
for(i in 1:3)
{
plot(XITS$alpha_otu97.5$rarity[,i] ~ log10(META[rownames(XITS$alpha_otu97.5$rarity),"Cs_137"]),col=sitecolors[as.numeric(as.factor(META[rownames(XITS$alpha_otu97.5$rarity),"site"]))],pch=16,cex=2,main=paste("ITS", colnames(XITS$alpha_otu97.5$rarity)[i]),ylab=colnames(XITS$alpha_otu97.5$rarity)[i],xlab="log10 137 Cs")
}
# Same for bacteria
for(i in 1:3)
{
plot(X16S$alpha_otu97.5$rarity[,i] ~ log10(META[rownames(X16S$alpha_otu97.5$rarity),"Cs_137"]),col=sitecolors[as.numeric(as.factor(META[rownames(X16S$alpha_otu97.5$rarity),"site"]))],pch=16,cex=2,main=paste("16S", colnames(X16S$alpha_otu97.5$rarity)[i]),ylab=colnames(X16S$alpha_otu97.5$rarity)[i],xlab="log10 137 Cs")
}
pairs(rar_all[,-1])
#cor.test(rar_all[,2],rar_all[,5])
#cor.test(rar_all[,3],rar_all[,6])
#cor.test(rar_all[,4],rar_all[,7])
rm(rar_all)
We will look into some classical beta diversity measures based on OTU’s, and next we will move on to unifrac-based distance methods, working with the ASV’s.
Visualisation of beta diversity will be with NMDS. Nasrin: first I visualized using MDS (PCoA), but going through literature I find more examples where they use nonmetric. Can you please double check which one is preferred and why?
First a Bray Curtis distance based on the OTUS. Notice that the plots are NMDS. They separate between sites extremely well for ITS. This needs more reflection because it’s super simple and by far the nicest beta diversity plot. We could also consider doing the same with a 95% identity cutoff for OTU’s, which should be entirely feasible with the functions defined above.
XITS$beta_otu97.5$BCdist <- vegan::vegdist(XITS$x_otu97.5, method = "bray")
FUNS$betaplot(XITS$beta_otu97.5$BCdist,2,"ITS: NMDS based on Bray Curtis distance")
## Run 0 stress 0.2195616
## Run 1 stress 0.21654
## ... New best solution
## ... Procrustes: rmse 0.05236158 max resid 0.1985932
## Run 2 stress 0.2170456
## Run 3 stress 0.2228401
## Run 4 stress 0.2216314
## Run 5 stress 0.2195616
## Run 6 stress 0.2161677
## ... New best solution
## ... Procrustes: rmse 0.01382243 max resid 0.05653742
## Run 7 stress 0.2226497
## Run 8 stress 0.2217721
## Run 9 stress 0.2194505
## Run 10 stress 0.2170722
## Run 11 stress 0.2168354
## Run 12 stress 0.2173403
## Run 13 stress 0.2166804
## Run 14 stress 0.2165399
## ... Procrustes: rmse 0.01378591 max resid 0.05662984
## Run 15 stress 0.2161854
## ... Procrustes: rmse 0.005813229 max resid 0.02668189
## Run 16 stress 0.2392967
## Run 17 stress 0.2457943
## Run 18 stress 0.2219012
## Run 19 stress 0.2234187
## Run 20 stress 0.2170724
## Run 21 stress 0.2173403
## Run 22 stress 0.2168353
## Run 23 stress 0.222221
## Run 24 stress 0.2214787
## Run 25 stress 0.2229275
## Run 26 stress 0.2168354
## Run 27 stress 0.2438068
## Run 28 stress 0.2173404
## Run 29 stress 0.2215053
## Run 30 stress 0.2166805
## Run 31 stress 0.2537598
## Run 32 stress 0.2217149
## Run 33 stress 0.2168354
## Run 34 stress 0.2167716
## Run 35 stress 0.2168356
## Run 36 stress 0.2165164
## ... Procrustes: rmse 0.01289329 max resid 0.05773489
## Run 37 stress 0.2165399
## ... Procrustes: rmse 0.01377609 max resid 0.0566067
## Run 38 stress 0.2225113
## Run 39 stress 0.2218663
## Run 40 stress 0.2213901
## Run 41 stress 0.245587
## Run 42 stress 0.2170726
## Run 43 stress 0.2476566
## Run 44 stress 0.221913
## Run 45 stress 0.2218941
## Run 46 stress 0.2216327
## Run 47 stress 0.2168354
## Run 48 stress 0.2165164
## ... Procrustes: rmse 0.01290842 max resid 0.05778355
## Run 49 stress 0.2216452
## Run 50 stress 0.2168354
## Run 51 stress 0.2218901
## Run 52 stress 0.2168358
## Run 53 stress 0.2165399
## ... Procrustes: rmse 0.01378736 max resid 0.05664269
## Run 54 stress 0.2170733
## Run 55 stress 0.2170452
## Run 56 stress 0.2217721
## Run 57 stress 0.2195844
## Run 58 stress 0.2165399
## ... Procrustes: rmse 0.01378502 max resid 0.05664394
## Run 59 stress 0.2168357
## Run 60 stress 0.221922
## Run 61 stress 0.2170721
## Run 62 stress 0.2168354
## Run 63 stress 0.2161851
## ... Procrustes: rmse 0.005744833 max resid 0.02655988
## Run 64 stress 0.2451963
## Run 65 stress 0.2168354
## Run 66 stress 0.2214647
## Run 67 stress 0.2168356
## Run 68 stress 0.2229272
## Run 69 stress 0.2482602
## Run 70 stress 0.2168354
## Run 71 stress 0.2218902
## Run 72 stress 0.2214787
## Run 73 stress 0.2168355
## Run 74 stress 0.2173403
## Run 75 stress 0.2228191
## Run 76 stress 0.2161851
## ... Procrustes: rmse 0.005700706 max resid 0.02646405
## Run 77 stress 0.2215941
## Run 78 stress 0.2274349
## Run 79 stress 0.2214786
## Run 80 stress 0.2170732
## Run 81 stress 0.2168487
## Run 82 stress 0.2228843
## Run 83 stress 0.219617
## Run 84 stress 0.2187805
## Run 85 stress 0.2229272
## Run 86 stress 0.2218607
## Run 87 stress 0.3956451
## Run 88 stress 0.2229273
## Run 89 stress 0.2173403
## Run 90 stress 0.2174479
## Run 91 stress 0.2161676
## ... New best solution
## ... Procrustes: rmse 0.0001113134 max resid 0.0005116164
## ... Similar to previous best
## *** Best solution repeated 1 times
For 16S, it looks less nice. I’m sure the culprit is the outliers A5T and T3T (see further). Try running it again without them.
X16S$beta_otu97.5$BCdist <- vegan::vegdist(X16S$x_otu97.5, method = "bray")
FUNS$betaplot(X16S$beta_otu97.5$BCdist,2,"16S: NMDS based on Bray Curtis distance")
## Run 0 stress 0.1511107
## Run 1 stress 0.1925715
## Run 2 stress 0.1511186
## ... Procrustes: rmse 0.005389597 max resid 0.02129102
## Run 3 stress 0.1511186
## ... Procrustes: rmse 0.005390565 max resid 0.02127708
## Run 4 stress 0.1951323
## Run 5 stress 0.15109
## ... New best solution
## ... Procrustes: rmse 0.002773529 max resid 0.01376874
## Run 6 stress 0.15109
## ... New best solution
## ... Procrustes: rmse 5.34065e-06 max resid 1.417754e-05
## ... Similar to previous best
## Run 7 stress 0.15109
## ... Procrustes: rmse 2.837567e-06 max resid 8.868053e-06
## ... Similar to previous best
## Run 8 stress 0.1511186
## ... Procrustes: rmse 0.004501855 max resid 0.02082956
## Run 9 stress 0.194051
## Run 10 stress 0.1809248
## Run 11 stress 0.2310836
## Run 12 stress 0.2010865
## Run 13 stress 0.1874341
## Run 14 stress 0.2097055
## Run 15 stress 0.2010865
## Run 16 stress 0.1511107
## ... Procrustes: rmse 0.002775317 max resid 0.01378201
## Run 17 stress 0.1511495
## ... Procrustes: rmse 0.005117323 max resid 0.02039434
## Run 18 stress 0.15109
## ... Procrustes: rmse 3.22082e-06 max resid 1.606723e-05
## ... Similar to previous best
## Run 19 stress 0.1809248
## Run 20 stress 0.2010865
## *** Best solution repeated 3 times
Next, we generate distance matrices and neighbour joining trees for UniFrac. I have put the entire thing in one function, with two inputs: the dataset (XITS or X16S), and the k value. It calculates 3 kinds of distance matrix based on k-mer distance (euclidean, cosine distance and edgar). Next, it generates neighbour joining trees. With these trees and the untransformed count matrix, it calculates weighted and unweighted unifrac.
Caution: this function was still written for an object that would contain a single count table called “X”. It performs OK still, as long as the order of features (columns) is unchaged between “X” (the original count matrix) and “x” (the count matrix with merged samples and outliers removed)
XITS <- FUNS$generate_beta_asv(XITS, 5)
XITS <- FUNS$generate_beta_asv(XITS, 7)
X16S <- FUNS$generate_beta_asv(X16S, 5)
X16S <- FUNS$generate_beta_asv(X16S, 7)
FUNS$betaplot(XITS$beta_asv$k5$cos$WUF,2,"ITS: NMDS based on weighted unifrac")
## Run 0 stress 0.1717321
## Run 1 stress 0.1717321
## ... New best solution
## ... Procrustes: rmse 0.001041446 max resid 0.005446282
## ... Similar to previous best
## Run 2 stress 0.1752452
## Run 3 stress 0.2441242
## Run 4 stress 0.2342796
## Run 5 stress 0.1717316
## ... New best solution
## ... Procrustes: rmse 0.0006298701 max resid 0.003295059
## ... Similar to previous best
## Run 6 stress 0.1717248
## ... New best solution
## ... Procrustes: rmse 0.01084488 max resid 0.05236991
## Run 7 stress 0.1717322
## ... Procrustes: rmse 0.01087074 max resid 0.05213562
## Run 8 stress 0.1717321
## ... Procrustes: rmse 0.01083825 max resid 0.05261946
## Run 9 stress 0.1717321
## ... Procrustes: rmse 0.01086832 max resid 0.05215046
## Run 10 stress 0.238092
## Run 11 stress 0.1717244
## ... New best solution
## ... Procrustes: rmse 0.001119744 max resid 0.005211145
## ... Similar to previous best
## Run 12 stress 0.2542908
## Run 13 stress 0.1717242
## ... New best solution
## ... Procrustes: rmse 0.0003050361 max resid 0.001292722
## ... Similar to previous best
## Run 14 stress 0.1991133
## Run 15 stress 0.171724
## ... New best solution
## ... Procrustes: rmse 0.0003126149 max resid 0.001581896
## ... Similar to previous best
## Run 16 stress 0.1717244
## ... Procrustes: rmse 0.000437372 max resid 0.0019257
## ... Similar to previous best
## Run 17 stress 0.1717317
## ... Procrustes: rmse 0.01062679 max resid 0.05050413
## Run 18 stress 0.1717243
## ... Procrustes: rmse 0.0004433173 max resid 0.002265181
## ... Similar to previous best
## Run 19 stress 0.2066635
## Run 20 stress 0.1717319
## ... Procrustes: rmse 0.010646 max resid 0.05041645
## *** Best solution repeated 3 times
FUNS$betaplot(XITS$beta_asv$k5$cos$UUF,2,"ITS: NMDS based on unweighted unifrac")
## Run 0 stress 0.2073058
## Run 1 stress 0.2061372
## ... New best solution
## ... Procrustes: rmse 0.03470784 max resid 0.1348326
## Run 2 stress 0.21436
## Run 3 stress 0.2061366
## ... New best solution
## ... Procrustes: rmse 0.0003729544 max resid 0.001863726
## ... Similar to previous best
## Run 4 stress 0.2065002
## ... Procrustes: rmse 0.01405123 max resid 0.05717469
## Run 5 stress 0.206137
## ... Procrustes: rmse 0.0007016308 max resid 0.00353124
## ... Similar to previous best
## Run 6 stress 0.2070166
## Run 7 stress 0.2338645
## Run 8 stress 0.2061371
## ... Procrustes: rmse 0.0003158124 max resid 0.001569558
## ... Similar to previous best
## Run 9 stress 0.2338459
## Run 10 stress 0.2581449
## Run 11 stress 0.2338351
## Run 12 stress 0.2061377
## ... Procrustes: rmse 0.001014854 max resid 0.005110311
## ... Similar to previous best
## Run 13 stress 0.2388458
## Run 14 stress 0.2065005
## ... Procrustes: rmse 0.0141096 max resid 0.0572858
## Run 15 stress 0.2061375
## ... Procrustes: rmse 0.0005101487 max resid 0.002560743
## ... Similar to previous best
## Run 16 stress 0.2061369
## ... Procrustes: rmse 0.0002474042 max resid 0.001245933
## ... Similar to previous best
## Run 17 stress 0.2399048
## Run 18 stress 0.2064996
## ... Procrustes: rmse 0.01366996 max resid 0.05644463
## Run 19 stress 0.2338631
## Run 20 stress 0.2063334
## ... Procrustes: rmse 0.00496518 max resid 0.02146803
## *** Best solution repeated 6 times
FUNS$betaplot(X16S$beta_asv$k5$cos$WUF,2,"16S: NMDS based on weighted unifrac")
## Run 0 stress 0.1411075
## Run 1 stress 0.1411075
## ... Procrustes: rmse 3.01245e-05 max resid 0.000101905
## ... Similar to previous best
## Run 2 stress 0.2030164
## Run 3 stress 0.1411075
## ... New best solution
## ... Procrustes: rmse 3.635883e-06 max resid 1.300527e-05
## ... Similar to previous best
## Run 4 stress 0.1448017
## Run 5 stress 0.1411075
## ... Procrustes: rmse 8.207412e-06 max resid 4.288476e-05
## ... Similar to previous best
## Run 6 stress 0.2030164
## Run 7 stress 0.2034605
## Run 8 stress 0.1411075
## ... Procrustes: rmse 1.373261e-05 max resid 4.121247e-05
## ... Similar to previous best
## Run 9 stress 0.191438
## Run 10 stress 0.1454371
## Run 11 stress 0.2172065
## Run 12 stress 0.238678
## Run 13 stress 0.1447925
## Run 14 stress 0.1411075
## ... Procrustes: rmse 2.009719e-05 max resid 5.942629e-05
## ... Similar to previous best
## Run 15 stress 0.1411075
## ... Procrustes: rmse 4.34741e-06 max resid 1.274688e-05
## ... Similar to previous best
## Run 16 stress 0.2186722
## Run 17 stress 0.1454371
## Run 18 stress 0.2186705
## Run 19 stress 0.1411075
## ... Procrustes: rmse 6.099202e-06 max resid 1.808999e-05
## ... Similar to previous best
## Run 20 stress 0.1454371
## *** Best solution repeated 6 times
FUNS$betaplot(X16S$beta_asv$k5$cos$UUF,2,"16S: NMDS based on unweighted unifrac")
## Run 0 stress 0.1384554
## Run 1 stress 0.1581625
## Run 2 stress 0.1709945
## Run 3 stress 0.1581617
## Run 4 stress 0.1403704
## Run 5 stress 0.1581617
## Run 6 stress 0.1581617
## Run 7 stress 0.2166407
## Run 8 stress 0.1928733
## Run 9 stress 0.1637117
## Run 10 stress 0.1384987
## ... Procrustes: rmse 0.01945306 max resid 0.0572303
## Run 11 stress 0.1404465
## Run 12 stress 0.1637118
## Run 13 stress 0.1926978
## Run 14 stress 0.1637118
## Run 15 stress 0.1922051
## Run 16 stress 0.1403703
## Run 17 stress 0.1385792
## ... Procrustes: rmse 0.01974085 max resid 0.05701214
## Run 18 stress 0.2156991
## Run 19 stress 0.2248407
## Run 20 stress 0.1581617
## Run 21 stress 0.1709944
## Run 22 stress 0.1384987
## ... Procrustes: rmse 0.01949274 max resid 0.05729367
## Run 23 stress 0.1637118
## Run 24 stress 0.1637118
## Run 25 stress 0.1384554
## ... New best solution
## ... Procrustes: rmse 0.0002098508 max resid 0.001129049
## ... Similar to previous best
## *** Best solution repeated 1 times
It becomes immediately clear that the 16S unifrac looks very messy. This is all due to the outliers (see earlier) and I’m pretty sure we should repeat the calculations without these datapoints. I was first thinking just removing the two datapoints from the unifrac outcomes, but NJ trees are used to infer evolutionary relationships based on distance data. Outliers can disproportionately influence the tree structure by creating long branches that can attract or repel other branches (long branch attraction). So the outliers would not only affect their own placement in the tree but also alter the topology of the tree regarding other samples. I guess this could distort the inferred relationships among the more typical samples.
I will recalculate the whole thing ditching the two outliers.
Note that these plots are just a selection for illustration for now. You can generate many more plots and they should be explored: * NMDS with more than 2 dimensions (but keep on checking the stress) * You can consider going back to PCoA (see earlier remarks) * Unifrac is based on trees, which are constructed in different manners (different k values and different distance metrics). You can go through them.
And finally the taxonomy part. Nothing much to say here for now, because it still needs closer inspection, but everything we need should be there. First, we assign taxonomy both for ITS and 16S:
### Assign taxonomy:
XITS$taxa <- assignTaxonomy(colnames(XITS$x), "~/db/sh_general_release_dynamic_s_all_04.04.2024.fasta",verbose=T,multithread=20)
X16S$taxa <- assignTaxonomy(colnames(X16S$x), "~/db/silva_nr99_v138.1_wSpecies_train_set.fa.gz",verbose=T,multithread=20)
And some barplots on the class level as an illustration:
FUNS$barplot_function(XITS$x,XITS$taxa,"Class",samples=c(1:35),other=12)
FUNS$barplot_function(X16S$x,X16S$taxa,"Class",samples=c(1:39),other=12)
knitr::knit_exit()